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Title graphics: This old Hippo learns a new trick- new regulatory mode of Hippo signaling inspires 
unconventional cancer therapy. 


Hippo-YAP signaling is a key pathway that regulates cell proliferation and organ size, whose mis- 
regulation is closely associated with human cancer. Activity of Hippo pathway is thought to be 
controlled at protein level through phosphorylation and degradation. Recently, Zefeng Wang's 
group from PICB and his collaborators found that Hippo-YAP pathway is controlled at RNA level 
through alternative splicing (Qi et al, Nat. Comm. 2016). They found that an effector of Hippo-YAP 
pathway, TEAD4, is controlled by alternative splicing switch to produce a truncated isoform that 
served as dominant negative of the canonical form. The new TEAD4 isoform suppresses proliferation 


and migration of cancer cells, and thus inhibits tumor growth in mouse model. Consistently, TEAD4 
splicing is altered in human cancer patients, which might be able to explore as a new anti-cancer 
strategy because patients with elevated levels of new TEAD4 isoform have an improved survival 
rate. As a key mechanism to increase coding complexity of human genome, and alternation of 
alternative splicing is a major hallmark of cancer. Splicing mis-regulations of cell signaling pathways 
play critical roles in cancer development and thus should be explored as a new route of potential 
cancer therapy. 


Cartoon designed by Z Wang and Y Wang, art by S.R. Zhang. 

Reference: Qi Y, Yu J, Han W, Fan X, Qian H, Wei H, Tsai YH, Zhao J, Zhang W, Liu Q, Meng S, Wang Y 
and Wang Z . A splicing switch of TEAD4 regulates Hippo-YAP signaling pathway to inhibit tumor 
proliferation (2016), Nat Comm, 2016:ncomms1 1840. 
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1 The CAS-MPG Partner Institute for 
Computational Biology 


1.1 Establishment 


The CAS-MPG Partner Institute for 


Computational Biology (PICB) is a research 
institute of the Chinese Academy of Sciences 
(CAS). PICB was established by the Max 
Planck Society (MPG) of Germany and CAS 

in 2005, and is jointly operated by MPG and 
CAS. PICB is dedicated to current topics 
in the biosciences with a particular 


focus on computational biology, with an 
overarching research goal to understand 
and interpret the principle of life using 
mathematic language. This institute is 
resulted from the rich history of scientific 
collaboration between MPG and CAS, and 
reflects the visions of the leading scientists in 
China and Germany. 


The purpose of this institute is to pursue the 
frontiers of knowledge, to contribute to the 
education and training of excellent junior 
scientists, and to complement the scientific 
research of the CAS and Max Planck Institutes. 
The organization of research at PICB follows 
the model of German Max Planck Institutes. 
PICB is headed by a Board of Directors 


comprised of Institute Directors/Department 
heads and, as of recently, other institute 
representatives. Directors are selected by 
the CAS-MPG Joint Core Commission and 
appointed by CAS. The Institute enjoys 
complete autonomy with regard to scientific 
focus. The scientific research work of PICB 

is subject to continuous quality control by 

a Scientific Advisory Board that evaluates 
scientific developments and the effective use 
of resources. 


1.2 Major Developments 
(2014-2017) 


Organization of PICB 


During the last reviewing cycle (2012-2014), 
this institute decided to adopt a relatively flat 
structure of many groups, some of which are 
headed by directors, while others are headed 
by research group leaders. This structure is 

a little different from a typical Max Planck 
Society, but is inspired for a new, better 
organizational structure came from the MPI 
for Cell Biology and Genetics in Dresden. This 
structure was maintained throughout the 
period of 2014-2017. 


Under current structure, Pl-groups are not part 
of a department but stand by themselves. 
Directors and IRG heads are selected by the 
joint CAS-MPG core committee and have their 
budget directly allocated by CAS and MPG. 

Pls are recruited according to strict scientific 
standards following a CAS procedure which 
relies on the vote of a committee of CAS 

Pls, consisting of PICB directors and Pls, and 


at least 1/3 of Pls external to PICB. PICB is 
managed by the board of directors, which has 
been extended by representatives from the 
IRG heads and the Pls. Current members in the 
board of directors are Drs. Zefeng Wang, Jing- 
dong Jackie Han, Philipp Khaitovich, Yixue Li, 
Shuhua Xu, and Li Yang. 


In November of 2016, a major administrative 
reform happened in Shanghai Institutes for 
Biological Sciences (SIBS) that is the host 
institute for PICB. In the old model, SIBS 
adopted a “two-layer” administrative and 
accounting system, in which PICB has its 

own administrative office and accounting 
personnel and make own decision on 
institution affair. This reform on SIBS should 
not affect PICB according to the discussion 
between MPG and CAS leadership in the 
Beijing “Strategy ERTC” meeting in November 
of 2016. Under the new structure after SIBS 
administrative reform, the administrative office 
of PICB was disbanded, and thus it will rely on 
different administrative branches of SIBS for 
institutional affair. However, PICB maintains 
the scientific autonomy and independent 


account, and the major decisions of the 
institute are still made by the board of 
directors and various institutional committees. 


Establishment of Bio-Med Big Data Center 


The rapid accumulation of big data in life 
science and medicine has transformed every 
filed of biology in recent years. Extensive and 
innovative analysis of biomedical big data 

is a key to the success of almost every large 
research initiative in life science, and has also 
provided a unique opportunity to the field 
of computational biology. To embrace the 
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challenge of big data, PICB has taken this 
unique opportunity by starting a scientific 
mission to translate big data into novel 
knowledge of human biology. In the past 
few years, PICB has been advocating for the 
construction of a centralized national-level 
biomedical big data infrastructure, and we are 
very encouraged by the responses of Chinese 
government that recognizes the crucial and 
urgent needs for this national infrastructure. 
On July 2016, the construction of a national 
biomedical big data center was formally 


approved as one of national science and 
technology infrastructure in the 13" 5-year 
strategic development plan of China. This 
infrastructure should achieve super capacity 
of safe data storage, standardized data 
processing, systematic data integration across 
multiple data types and in-depth data mining. 


To prepare for this national initiative, PICB 

has established a Bio-Med big data center 
within SIBS in January 2016. A new faculty, Dr. 
Yixue Li, was recruited to PICB as the director 
of Bio-Med big data center. Several faculties 
were recruited by PICB to work exclusively 

for the Bio-Med big data center. In addition, 
we recruited Dr. Guoping Zhao as an adjunct 
member of PICB to work as the leading 
investigator of Bio-Med big data center in 
PICB. PICB has also obtained additional space 
from SIBS to accommodate Bio-Med big data 
center. Our Bio-Med big data center seek to 
serve as a hub for the storage, standardization 
and sharing of Bio-Medical big data (such as 
muti-omics data), and also provide various 
data analysis service to the entire research 
community. This center also serves as a 
prototype for the national infrastructure of 
Bio-Medical big data, which will be formally 
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s big data center, PICB has reorganized its 
Be na ' institutional research support facilities. 
These research support facilities, including 


information technology facility, Uli-Schwarz 
quantitative biology laboratory, and omics/ 


launched after an open call of competitive 


sequencing core, are now integrated as three 
proposals. 


core facilities under big data center and 
E , , operated under the centralized leadership 
To facilitate the construction of national i 7 ; 
(Figure 1). Two additional platforms, Bio-Med 
infrastructure of Bio-Medical big data, 
SIBS/PICB signed a contract with Gui'an 


government in June 2017 to build a data 


database core facility and bioinformatics 
service core facility, will also be built under 


the Bio-Med big data center. These five core 
storage/backup node in the national big data secs , 
E ENE facilities forms the main structure of our Bio- 
industrial park in Gui'an, GuiZhou Province. l eu 

P Med big data center on the Shanghai site. 
The operation of the facility can 


save energy consumption by 80% per year. 


Also, the long-term plan will build additional 
nodes of the infrastructure to form a network 
over the entire country to provide regional 
service. PICB is running the big data center in 


both Gui'an site and Shanghai site, which has 
obtained financial support from each local 
government. The Bio-Med big data center has 


grant competitions. 
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Figure 1: Organization of Bio-Med big data center in PICB/SIBS. 
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Changes in the faculty 


In September of 2015, Zefeng Wang, a 

former associate professor in University of 
North Carolina at Chapel Hill, was recruited 

as a Director at the institute. He has since 
established his own department of RNA 
system biology, described later in this 

report. He started to serve as the managing 
director of PICB at October 2016. Newly 
recruited Pls are Dr. Yixue Li, arriving in 2015 
from Shanghai Institute of Biochemistry 

and Cell Biology, Dr. Guangzhong Wang, 
arriving in 2016 from University of Texas 
Southwestern Medical Center, Dr. Guoging 
Zhang, arriving in 2016 from Shanghai Center 
for Bioinformatics, Dr. Wu Wei, arriving in 2017 
from Stanford University, and Dr. Qiannan 

Hu, arriving in 2017 from Tianjin Institute of 
Industrial Biotechnology. During the reporting 
period, Dr. Jun Yan has moved to Institute of 
Neuroscience, and Drs. Christine Nardini and 


Jin Yang's tenures at PICB have ended. 
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1.4 Research Concept 


PICB was established in 2005 with initial 
emphasis on developing theoretical tools 
and mathematic algorism to model life 
science. With the evolution of computational 
biology field, the institute has transformed 
from an exclusively theoretical research 
institute into rather a merger of theoretical 
and experimental biology though recruiting 
new research groups that conduct both “dry” 
and “wet" research. With the recent rapid 
expansion of biological big data, the field 

of computational biology has shifted from 

a hypothesis-driven science toward a data- 
driven science. This shift of research paradigm 
provides both a challenge and a unique 
opportunity to PICB, and the innovative use 
of big data has been involved in the research 
of all groups. Therefore, the general research 
concept of PICB is to study diverse types 

of key biological questions based on the 
innovative analyses of Bio-Medical big data 


across biomolecules, biological systems and 
human populations. Currently the three focal 
areas of research of PICB are: 


+ Integrative study of gene regulation using 
multi-omics data 


+ System model and simulation for the 
regulation of complex trait 


+ Human evolution and adaptation model 
using population genetics 


These three areas deal with biological big 
data in molecular level, individual level and 
population level. The groups of PICB also 
seek to develop new algorithms and network 
modeling approaches to analyze multi- 
omics data that includes diverse data-types 
like transcriptome data, gene regulation 

data, epigenetic data, proteomics, and 
metabolomics data. Computational biology 
today largely deals with these data-types, 
their analysis, integration, and interpretation. 
Therefore, PICB thrives to seamlessly integrate 
computational and experimental biology 

to understand biological processes through 
quantitative approaches. 


CAS regularly requests institutes to provide 
a strategic vision that identify the potential 
area for main breakthrough and cultivated 
research directions. PICB has identified the 
comprehensive analysis biological big data 
as its main research focus and seek to make 
potential breakthrough in novel methods/ 


platform in data standardization, data sharing, 
in-depth data mining and data integration. In 
addition, PICB has identified three cultivated 
areas where, according to the above 
philosophy, we hope to achieve significant 
progress in the next five years. 
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+ Identify an integrative model for the underlying human divergence and history 
feQuiaiion OF GENE CAPERS spanning These three directions represent integrative 
epigenetic gene regulation, RNA process 


and regulation, protein production and 


analysis of multi-omics data, computational 
systems biology, and population genetics 


pecan and evolution, all of which will flourish on 

+ Establish a regulatory model for the aging the foundation of biomedical big data. The 
phenome, proteome, transcriptome, following figure schematically depicts how 
epigenome, metabolome, and reconstruct the present directors and group leaders are 
its regulatory network organized according to these topics: 


+ Identify the adaptation mechanism 
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Figure 2: Research Clusters for major research areas. 
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Figure 3: Research focus, research breakthrough and cultivated research directions 
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1.5 Research Progress 


With the support of CAS and MPG, and 
through the joint efforts of all PICB members, 
PICB has made substantial further progress 
since the previous evaluation in 2014. 
Underpinning this progress has been the 
establishment of the Bio-Medical Big Data 
Center, which is a new focus for the institute. 


In addition, the state-of-the-art computational 


and experimental facilities in PICB have 
continued to grow and improve to ensure the 
international competitiveness of PICB. 


PICB’s progress is represented in the following 
aspects: 


+ In the aspect of talent resource building, 
during 2015-2017, PICB recruited Dr. Zefeng 
Wang through the “Hundred Talents 
Program’ of “Pioneer Initiative” as a CAS 
“academic leader” (category A), Dr. Wu Wei 
as a Group 


Leader. In addition, PICB also recruited Drs. 


Yixue Li, Qiannan Hu, Guangzhong Wang, 
Guoging Zhang as an independent PI. All 
of these appointments have strengthened 
our presence in computational genomics 
and big data analysis. At present, the 
number of research backbone staff, 
including department directors, CAS-MPG 
Research Group Leaders and Principal 
Investigators (Pl), has brought the total 
number of staff and postgraduates to over 
230 people. 


+ Extensive and innovative utilization 
of biomedical big data is the future of 
computational biology. Currently PICB 


has taken a leading role to promote 

the construction a centralized national- 
level biomedical big data infrastructure. 
The institute has invested on recruiting 
several new faculty members in this area 
and allocated an annual running budget 


for 3 million RMB for the first two years 
operation of Bio-Med big data center in 
PICB/SIBS. The plan is to have the big data 
center to obtain external fund for future 
development and serve as a core unit in 
operation of future national infrastructure 
of Bio-Med big data. In 2017, the big data 
center of PICB/SIBS has secured ~50 million 
RMB fund from Guizhou local government 
to set up a satellite data storage/ 
management site in Gui’an, and in the 
process to secure another 100 million RMB 
from Shanghai municipality for setting 

up the Shanghai site. The construction of 
big-data center will put PICB in a favorable 
position to lead the computational biology 
field in big data revolution. Through this 
center, the institute seeks to become a 
critical hub for Bio-Medical data collection, 


standardization, distribution and analysis. 


+ PICB has continued to improve our high 


performance computing system and high 
throughput quantitative biology platform 
to provide our researchers with high- 
throughput biological data acquisition and 
powerful processing facilities. The floating- 
point performance of the computer 
cluster now reaches130 TFLOPS (trillion 
operations/sec); the network storage 

(NAS) capacity is more than 2.3PB and the 
network transfer rate is up to 400 MBbit/ 

s. Similarly, the Omics core lab has by now 
processed over 1300 samples, fulfilled 


4.2 million CNY service contracts, and 
generated close to 3000 Gigabases of 
high-quality data. 


+ In the aspect of external cooperation, PICB 
has established more extensive and active 
academic exchanges and continuously 
cooperates with domestic and foreign 
institutes, which continuously enhances 
its prestige and influence at home and 
abroad. In the past three years, PICB has 
organized 4 international conferences, 
received 161 visiting scholars, and sent 204 
people abroad for academic conferences 
and cooperative research visits. 


+ In terms of research output, PICB has 
produced, in the past three years, a total 
104 of high-level academic papers, all 
of which have been published, with 
an average impact factor of 7.26 in 
line with the impact factor of the top 
Computational Biology and Bioinformatics 
journals (e.g. PLoS Computational Biology/ 
Bioinformatics). 


+ While CAS and MPG have increased their 
investment, doubling the institute's regular 
funds, over the past three years, PICB’s 
external fund has now reached almost 200 
million RMB. 


1.6 Future Prospects 


With the new challenge and opportunity 

of expanding biological big data, PICB has 
taken a unique position to become a leading 
institute in this exciting area and serves as 

a national and international hub for new 
technologies in data achieving, integration 
and analysis. Significant efforts must continue 
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to maintain PICB as a highly respected and 
influential international research institute 

in the field of computational biology at big 
data era. The institute has secured sizable 
new funds (equivalent to ~30 million USD) to 
establish the Bio-Med big data center, which 
will initiate excellent international research 
platforms on data-driven biological science 
and help attract outstanding scientists in 
Computational Biology. 


As a successful model for international 
cooperation in science and technology, 

PICB is especially important to promote the 
interdisciplinary research, and to enhance 

the exchanges of both innovative ideas and 
scientific data in an unprecedented scale. For 
PICB to achieve its development goals and 
become a scientific beacon in the coming 
year, a number of tasks should be completed 
in the near future: 


1. Increase the efforts of recruiting excellent 
talents from all countries to promote the 
scale and quality of the research teams, 
especially in the emerging field of data 
science. The research talents attracted 
by PICB will fall within our research 
breakthrough and three cultivated 
directions, including the engineering 
talents in the emerging important areas 
such as epigenomics, medical genomics, 
and dynamical systems modeling. The 
support of MPG in recruiting an additional 
institution director in this area and an IRG 
will be critical for future success of this 
endeavor. 


2. Gain continuous support from both CAS 
and MPG through extension of their 


collaboration contract. The newly recruited 
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talents will also fit in new research goal of 
SIBS that has placed more emphasis on 
the health of an aging population, such as 
chronic and age-related diseases. 


3. Foster more intense collaborative efforts 
between other Max-Planck institutions 
and within the Shanghai Institutes for 
Biological Sciences, as well as with other 
research institutes in China and abroad. 
PICB will organize important international 
research conferences in a regular basis, and 
attract world-renowned scientist as guest 
investigator to promote new technology 
development. To take full advantage of an 


1.7 Organizational Chart 


Department Directors 


Zefeng Wang 


independent 
Research Gre 


Leader 


Xinguang Zhu 


international coorporative institute, PICB 
will establish guest group consist of foreign 
scientist to facilitate key breakthroughs in 
selected areas. 


4. Improving the mentoring and career 
development program for students, 
postdocs and staff, including grant clinics. 


5. Improve intra-PICB collaborative efforts 
and synergy through regular themed 


internal seminar programs. 


Implementing the above tasks will be key 
for PICB to achieve greater success in the 
forthcoming years. 
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Figure 4: Overall organization of PICB 


DIRECTORS’ RESEARCH GROUPS 


2 Directors’ Research Groups 


2.1 RNA System Biology 


Visiting Scholar: 

Xiaowei Song, Miaowei Mao 
Visiting student: 

Jiaqiang Zhang, Yun Yang, 
Yue Hu, Chuyun Chen 


Researchers: 


Dr. Zefeng Wang (Principal 

Investigator) 

Phone: +86-21-54920416 

Email: wangzefeng@picb.ac.cn 
Past members: 


Current members: Zhaoyuan Fang, 


Graduate Student: Research Staff, 2016-2017, now at 
Qianyun Lu, Jiefu Li Bio-Med Big Data Center 
Sirui Zhang Xuerong Yang, 


Post-doc, 2015-2017, now 
as Associate Professor in 


Research Staff: 


Yun Jiang, Huanhuan Wei, Shangdong Agriculture Jieyun Yin, 
Wenjian Han, Xiaoyan Feng University Technician 2015-2016, now as 
Yuanlong Liu, student at UCSD. 


Post-doc, 2015-2016, now at UC 


Irvine 


Research 
Overview 


The main focus of my lab is to study gene 
regulation at RNA level using system biology 
approaches. Our aim is to gain a better 
understanding of RNA biology at genomic 
scale, and use such new knowledge to 
improve human health. We are currently 
working on three related areas. First, we 
systematically study the regulation of 
alternative splicing and its implication in 
human cancers, and explore the possibility 
to target splicing regulation as new anti- 
cancer therapy. Second, we study the splicing 
regulation and biogenesis of long non-coding 
RNA, especially the biogenesis and function of 
circular RNAs. Finally, we are developing new 


approaches to specifically manipulate RNA 
metabolism using engineered protein factors 
that enable “transcriptome editing” and have 
therapeutic values for RNA-related human 
diseases. 


|. Splicing regulation in normal 
tissues and cancers. 


More than 90% of all human genes undergo 
alterative splicing to produce multiple 
isoforms with distinct activities. This process 
is tightly regulated in vivo, and the mis- 
regulation of splicing is a common cause of 


human diseases. In particular, splicing mis- 
regulation is one of the molecular hallmarks 
in human cancer, with hundreds of genes 
shifting their splicing into cancer specific 


isoforms in tumors. The genes with cancer 
associated splicing isoforms can affect some 
key cellular pathways, including cell cycle 
and DNA repair. My lab has been focused 

on the study of general rules for splicing 
regulation. We have systematically identified 
splicing regulatory cis-elements and cognate 
trans-acting splicing factors, and study their 
interaction in the cellular contexts of normal 
and cancer cells. We seek to identify key 
splicing events that can be used as molecular 


markers or therapeutic targets of human 
cancers. 
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Current State of Research 


Splicing regulation and cancer 


1.1 Systematic identification of cancer 
specific splicing events. 

Dysregulation of alternative splicing (AS) is 
one of molecular hallmarks of cancer, with 
splicing alteration of numerous genes in 
cancer patients. However, studying splicing 
mis-regulation in cancer is complicated by 
large noise of tissues-specific splicing. To 
obtain a global picture of cancer-specific 
splicing, we analyzed transcriptome 
sequencing data from 1149 patients in TCGA 
project, producing a core set of AS events 
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significantly altered across multiple cancer 
types. These cancer-specific AS events are 
highly conserved, more likely to maintain 
protein reading frame, and mainly function 

in cell cycle, cell adhesion/migration, and 
insulin signaling pathway. Furthermore, 
these events can serve as new molecular 
biomarkers to distinguish cancer from normal 
tissues, to separate cancer subtypes, and to 
predict patient survival. Surprisingly we also 
found that most genes whose expression 

is associated with cancer-specific splicing 

are key regulators of the cell cycle (Figure 

1), providing mechanistic insight into how 
splicing is mis-regulated in cancers. This work 
was published in Oncotarget. 
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Figure 1. Interacting network for genes whose expression is associated with cancer-specific splicing. Color 
coded proteins were clustered by MCODE, and the most enriched function/Go term was labeled next to each cluster. 


We are using the CRISPR-cas method to 
systematically carry out isoform-specific knock, 
and then examine how the different splicing 
isoforms of these genes can differentially 
affect cancer cell proliferation and migration. 


1.2 A high-resolution transcriptome map 
of cell cycle and periodic splicing. 


The mis-regulation in cell division is one of 
the most prominent features of cancer cells. 
Cell division is largely orchestrated by periodic 
gene regulation conventionally thought to 
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be mediated through transcription. Although tissues by their transcriptome similarity to 
most genes undergo cell cycle, the role of different cell cycle stages. By analyzing >4,000 
alternative splicing in cell cycle regulation tumor samples, we found that mitotic trait 
was not thoroughly investigated. We significantly correlate with genetic alterations, 
have conducted the deep sequencing of the tumor subtype and patient survival. We 
human transcriptome through two continuous further found that periodic genes have unique 
cell cycles, revealing periodic dynamics of chromatin features including increased levels 
>1,000 coding and non-coding RNAs. Our set of CTCF/RAD21 and H3K36me3, providing new 
of periodic genes was used to develop a new insights into chromatin-mediated control of 
computational approach termed “mitotic trait”, periodic gene regulation and offers a powerful 
which can classify primary tumors and normal predictor of cancer patient outcomes. 
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Figure 2. Global detection of periodic cell cycle-dependent alternative splicing. (A), Heat map 
representation of periodically spliced events. Row-normalized relative PSI values are shown. Diagram below 
indicates cell cycle phase. (B) Overlap between periodically spliced genes and periodically expressed genes detected 
by RNA-Seq. (C), Heat map representation of enriched Gene Ontology terms shown as log(P-value). (D), Real- 

time quantitative PCR analysis of periodic retained introns and total mRNAs for three selected genes. Cells were 
synchronized by double thymidine block and samples were collected 0, 3, 6, 9, 12 and 15 hours post release. (E), 
Schematic representation of AURKB AS pattern. Line graph showing the relationship between intron retention and 
mRNA levels for the AURKB gene across the cell cycle. Percent intron retention (solid red line) across cell cycle was 
used to determine the fraction of total mRNAs (solid blue line) not containing an intron, i.e. ‘corrected’ mRNA levels 
(dashed blue line). 
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In addition, the sequencing of human 
transcriptome through two continuous cell 
cycles enable us to identify ~1,300 genes with 
cell cycle-dependent AS changes. These genes 
are significantly enriched in functions linked to 
cell cycle control, yet they do not significantly 
overlap genes subject to periodic changes 

in steady-state transcript levels (Figure 2). 
Mechanistically, we found that Cdc2-like 
kinase 1 undergoes periodic fluctuations 
through an auto-inhibitory circuit to control 

a network of periodic splicing events that are 
required for cell cycle. Our findings elucidate a 
novel mechanism for periodic gene regulation 
independent of transcription, suggesting 

that proteome expansion via splicing adds 

a new regulatory layer for control of gene 
function during cell division. These results 
were published as two papers in eLife and Cell 
Research. 


1.3. Splicing control in Hippo-YAP pathway 
provides a new anti-cancer target. 


Splicing deregulations extensively occur in 
cancers, yet the biological consequences of 
such alterations are mostly undefined. We 
found that the Hippo-YAP signaling, a key 
pathway that regulates cell proliferation 

and organ size, is under control of a splicing 
switch. We demonstrated that TEAD4, the 
transcription factor that mediates Hippo-YAP 
signaling, undergoes alternative splicing 
facilitated by the tumor suppressor RBM4, 
producing a truncated isoform, TEAD4-S, 
which lacks N-terminal DNA-binding domain 
but maintains YAP-interaction domain. 
TEADA4-S is located in both nucleus and 
cytoplasm, acting as a dominant negative 
isoform to YAP activity (Figure 3). Consistently, 
TEADA4-S is reduced in cancer cells, and its re- 


expression suppresses cancer cell proliferation 
and migration, inhibiting tumor growth in 
xenograft mouse models. Furthermore, 
TEAD4-S is reduced in human cancers, and 
patients with elevated TEAD4-S levels have 
improved survival. Altogether these data 
reveal a splicing switch that serves to fine- 
tune Hippo-YAP pathway, which opens new 
doors to target major cellular signal pathway 
through splicing manipulation. This work was 
published in Nature Communications, and 
was highlighted by CAS newsroom. 


Extracellular 
signals 


rene 


CHS qgrenns.s 
os) o 
RBM4 P 2 
Promote 
.q transcription 


YAP/TEAD 
TEAD4 Targets 
binding site 


Figure 3. A schematic model of how the Hippo-YAP 
pathway is regulated through a splicing switch. 
Activity of Hippo-YAP pathway is mainly controlled through 
protein phosphorylation and degradation. Under normal 
condition, YAP is translocated into nucleus and recruited 

to DNA by TEAD4 to stimulate cell proliferation. TEAD4 is 
controlled by RBM4 to produce a truncated splicing isoform, 
TEAD4-S, which lacks N-terminal DNA-binding domain 

but contains YAP-interaction domain. Therefore, TEAD4-S 
suppresses the translocation of YAP and acts as a dominant 
negative isoform to YAP. Splicing of TEAD4-S can inhibit 
cancer progression in vivo. Consistently, TEAD4-S is reduced 
in human cancers, which might be able to explore as anew 
anti-cancer strategy. 


We have further studied the splicing 
regulation of YAP, another key component 
of Hippo-YAP pathway. We found that 
different YAP isoforms have distinct activities 
in promoting transcription of downstream 


genes, and disrupt of such splicing balance 
will affect cancer growth. The manuscript on 
this work is under preparation. 


Future Perspective 


To study the splicing regulation in cancer 

vs normal cells, we will mainly apply a 
systematic approach, using the large dataset 
generated from TCGA (the cancer genome 
atlas) project and ENCODE (such as mRNA- 
seq, CLIP-seq data), to infer the general rules 
of splicing regulation in cancers. We will use 
gene association analyses to predict how 
the different alternatively spliced events are 
controlled by the levels of various splicing 
factors in different cancer samples. Such 
association will be further tested in cultured 
cells using the over-expression or knockdown 
of specific splicing factors. We aim to identify 
the splicing factors that play key roles in 
maintaining the cancer specific splicing 
patterns, such factors will be new candidate 
targets for tumor therapies. 


Il. Splicing, biogenesis and function 
of long non-conventional RNAs. 


Recent transcriptome-wide studies have 
revealed a large amount of long non-coding 
RNAs with diverse functions, and thus the 
biogenesis and functions of these IncRNAs 
become an emerging field with great impact 
in gene expression. Like coding genes, most 
known IncRNAs are transcribed by RNA Pol Il 


DIRECTORS’ RESEARCH GROUPS 


and undergo post-transcriptional processing 
(e.g. splicing, polyadenylation, etc). However 
little is known on how the splicing of IncRNA 
is regulated, although there are reports that 
the biogenesis of IncRNAs is depended 

on correct splicing. In particular, a non- 
conventional splicing process called back- 
splicing (i.e. a 5’ exons is spliced back to the 
downstream of a 3’ exon) can produce large 


number of circular RNAs (circRNAs). We have 
previously identified several circular RNAs that 
may play important roles in cardiovascular 
diseases, but the biological functions of most 
circRNAs are unclear. We are currently using 
splicing reporters to study how back splicing 
can be regulated in vivo by cis-regulatory 
elements and splicing factors. In addition, we 
have developed a GFP based circRNA reporter 
to study the translation of circRNAs and 

the biological function of circRNA encoded 
proteins. 


Current State of Research 


Splicing and function of non-coding genes 


2.1. Development of a GFP based reporter 
to study back splicing of circRNA 


While the human transcriptome contains 

a large number of circRNAs, the functions 

of most circRNAs remain unclear. Most 
circRNAs are generated from splicing in 
reversed orders across exons. However, the 
mechanisms of this backsplicing are largely 
unknown. We have constructed a single 
exon minigene containing split GFP, and 
found that the pre-mRNA indeed produces 
circRNA through efficient back-splicing in 
human and Drosophila cells. The backsplicing 
is enhanced by complementary introns that 
form double-stranded RNA structure to bring 


splice sites in proximity, but such structure 

is not required. Moreover, backsplicing is 
regulated by general splicing factors and cis- 
elements, but with regulatory rules distinct 
from canonical splicing. The resulting circRNA 
can be translated to generate functional 
proteins. Unlike linear mRNA, poly-adenysine 


or poly-thymidine in 3’ UTR can inhibit circular 
mRNA translation. This study revealed that 
backsplicing can occur efficiently in diverse 
eukaryotes to generate translatable circRNAs, 
and was published in RNA. 


2.2 Extensive translation of circular RNAs 


Although circRNAs are fairly prevalent in 
human transcriptome, the biological functions 
of these circRNAs remain largely unclear. Using 
the circRNA reporter system developed in 

our lab, we found that N6-methyladenosine 
(m6A), the most abundant base modification 
of RNA, efficiently initiates protein translation 
from circRNAs in human cells. We discover 
that consensus m6A motifs are enriched in 
circRNAs and a single m6A site is sufficient to 
drive translation initiation. This m6A-driven 
translation requires initiation factor elF4G2 and 
mé6A reader protein YTHDF3, and is enhanced 
by methyltransferase METTL3/14, inhibited 

by demethylase FTO (Figure 4). We found 
that the circRNA translation is upregulated 


upon heat shock, however the detrailed 
mechanism is unknown. Further analyses 
through polysome profiling, computational 
prediction and mass spectrometry reveal 


that m6A-driven translation of circRNAs is 
widespread, with hundreds of endogenous 
circRNAs having the translation potential. 
Our study expands the coding landscape of 
human transcriptome, and suggests a role of 
circCRNA-derived proteins in cellular responses 


Of 
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to environmental stress. This work leads to 
major change in the scope of IncRNA, and is 
published in Cell Research. 
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Figure 4. A schematic diagram of circRNA translation 
driven by m6A. The m6A site in circRNA is recognized by 
reader protein YTHDF3, which binds translation initiation 
factor elF4G2 that serves as a scaffold protein to facilitate 
assembly of ribosome to translate protein. A large number 
of circRNAs can direct protein synthesis through this 
mechanism. 


Future Perspective 


We will further study how the splicing factors 
or other RNA binding proteins regulate 

the IncRNA biogenesis, and determine if 

the splicing of IncRNA is controlled in a 

similar fashion as the mRNA. We will focus 
on fundamental questions including: (1) 


Are the IncRNAs (especially the nuclear 
IncRNA) spliced by the same mechanism as 
that of the mRNA? (2) How prevalent is the 
alternative splicing in IncRNA? (3) What is the 


consequence of alternatively spliced IncRNAs? 
(4) Do the splicing factors and cis-elements 
control the alternative splicing of IncRNA? Can 
the same regulation rules apply to IncRNA? 
We will use our experience in developing 
splicing reporters and identifying RNA-protein 
interactions, and integrate the computational 
methods with experimental approaches for 
these studies. 
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Figure 5. Engineering artificial factors to specifically manipulate RNA metabolism. (A) Modular binding between 
PUF domain and RNA targets. (B) Combine PUF domain with various effector domains to generate artificial PUF factors that 


can manipulate different stages of RNA metabolism. 


Ill. Engineering artificial factors to 
manipulate RNA metabolism. 


The regulation of RNA metabolism plays a 
central role in controlling gene expression, and 
dysregulations in RNA processing are closely 
associated with human diseases. Therefore 
manipulation of RNAs with bioengineering 
approaches can provide a powerful tool to 
facilitate the study of gene regulation and the 
development of new therapeutic approaches. 
My lab is one of the pioneers in the filed of 
engineering artificial RNA-binding factors. 
Using a synthetic biology approach, we have 
designed a class of artificial proteins that 


contain a programmable RNA recognition 
module (PUF domain) and a function module 
(Figure 5). The PUF domain can recognize 

RNA targets in a predictable fashion: it contains 
eight tandem repeats that each recognizes a 
single base in consecutive RNA fragment. Such 
nteractions resemble typical Watson-Crick 
base pairs found in dsRNA, enable the tight 
inding of PUF-RNA whose specificity can be 


ion 


reprogrammed. We have used these artificial 
factors to specifically manipulate RNAs in the 
steps of splicing and RNA degradation, and 
explore the therapeutic applications of this 
new “transcriptome editing” approach. 


Current State of Research 


Specific manipulation of RNA with artificial 
proteins 


Based on the modular design, we have 
previous engineered artificial proteins by 
combine a RNA binding module (PUF domain) 
with a functional module. Since 2015, we 

are actively work on the optimization of 

the RNA binding code for PUF domain. 
Endogenous PUF domain has eight repeats 
that recognize an RNA target of 8-nt, and 

we have re-engineered PUF domains to 
recognize RNA sequences of different length 
in a modular fashion. Unexpectedly We found 
that increasing the number of PUF repeats 
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does not always improve the binding affinity. 
The structural insights of PUF with 9 repeats 
(PUF-9R) and in complex with its cognate 
RNA suggest that PUF-9R recognizes the 
RNA in a modular fashion and the binding 

of RNA alters the PUF curvature, indicating 
that the conformational changes of PUF can 


affect binding affinity. In addition, we used 


structure based protein modeling to trim the 
redundant sequences in PUF domain, which 
improve its binding affinity to RNA. Our results 
indicate that varying the number of repeats 

in engineered PUFs will be very useful in 
reducing low off-target effect of these factors, 
and the manuscript describe this work is 
under preparation. 


Future Perspective 


We will expand our work on engineered RNA 
binding factors in three main directions: (1) We 
will develop additional artificial factors with 
different activities as novel research tools (eg,, 
manipulating RNA editing or polyadenylation); 
(2) We will refine the binding code of RNA- 
PUF domains and produce the designer PUFs 
with a stepwise assembly pipeline rather 

than mutagenesis; (3) We will use the artificial 
factors to target the toxic RNA repeats in 
neurodegenerative diseases (such as myotonic 
dystrophy, spinocerebellar ataxia, huntington 
disease and so on). 
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Research 
Overview 


The Department of Molecular Systems Biology 
is headed by Dr. Jing-Dong Jackie Han since 
she joined PICB in 2010. The research of our 
group is focused on three areas: 1) Systems 
biology of aging; 2) Transcriptome dynamic 
change during embryonic development; 

3) Computational algorithm development 
for data integration and network analysis. 
Using data-mining, statistics approaches and 
network theories, we first try to generate 
biological hypotheses and computational 
models, and then use molecular biology, cell 
biology and systems biology approaches to 
valdate and refine models and hypotheses. 


I. Systems biology of 
aging 


Current State of Research 


Aging is an irreversible and complex biological 
process faced by all the organisms, which 
characterized by the gradual and progressive 
decline of numerous physiological functions 
and homeostasis, leading to many diseases 
and, ultimately, death. Despite the ubiquity 
and importance of aging, the molecular basis 
remains incompletely understood. 


Reliable prediction of the aging process is 
important for assessing the risks of aging- 
associated diseases. However, despite 
intense research, so far there is no reliable 
aging marker. Recently, we addressed this 
problem by examining whether human 3D 
facial imaging features could be used as 
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Figure 1. Visualization of facial aging. (A) The female and male average profiles of five age groups from 17 to 77 years old. 
Nindicates the number of subjects in each age group. (B-D) Synthesized female and male average profiles between —2 and +2SD of 
loading values of age-correlated PLS component 1 (B), component 2 (C) and combined components 1 and 2 (D). (EG)Heat map of 
3D effects displaying loading values of age-correlated PLS component 1 (E), component 2 (F) and combined components 1 and 2 (G) 
shown on female and male faces. The loading values were multiplied by 10 000. Red and blue denote, respectively, higher and lower 


values along x-, y- and z-axes. 


reliable aging markers. We collected > 300 
3D human facial images and blood profiles 
well-distributed across ages of 17 to 77 years. 
By analyzing the morphological profiles, we 
generated the first comprehensive map of the 
aging human facial phenome. We identified 
quantitative facial features, such as eye slopes, 
highly associated with age. We constructed 

a robust age predictor and found that on 
average people of the same chronological 
age differ by + 6 years in facial age, with the 


deviations increasing after age 40. Using this 
predictor, we identified slow and fast agers 
that are significantly supported by levels of 
health indicators. Despite a close relationship 
between facial morphological features and 
health indicators in the blood, facial features 
are more reliable aging biomarkers than blood 
profiles and can better reflect the general 
health status than chronological age. 
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Figure 2. Aging-related facial morphological phenotypes. (A) The 17 landmarks used to align all faces. (B) Clustering of all 
quantified facial features, blood serum indicators, blood cell indicators, body indexes and PLS components in females and males. 
(C) The correlation network of female facial features and PLS components. (D) The network of male facial features and PLS 
components. Node size is proportional to the correlation between the feature or compon 


Dietary restriction (DR) is the most powerful in Caenorhabditis elegans and find that 
natural means to extend lifespan. Although early and late responses involve metabolism 
several genes can mediate responses to and cell cycle/DNA damage, respectively. 
alternate DR regimens, no single genetic We uncover three network modules of 
intervention has recapitulated the full effects DR regulators by their target specificity. By 

of DR, and no unified system is known for genetic manipulations of nodes representing 
different DR regimens. Here we obtain discrete modules, we induce transcriptomes 
temporally resolved transcriptomes during that progressively resemble DR as multiple 
calorie restriction and intermittent fasting nodes are perturbed. Targeting all three nodes 


25 


DIRECTORS’ RESEARCH GROUPS 


26 


| 


AL H adit 
CR ETL 
r Hinnr TS 


Temporal expression Perturbation microarrays TFs ChiP-seq 


eOMEN & enuiehment: 


Calorie restriction/ 


Intermittent fasting 


Č% 


Super long-lived 


Insulin 


A 


Refactory to CR 


Figure 3. A Systems Approach to Reverse Engineer Lifespan Extension by Dietary Restriction. We obtain temporally resolved effects 
of diet restriction on aging transcriptomes. Early responses involve metabolism; late involve cell cycle and DNA damage. We find 
three regulator groups with novel regulators separated by target specificity.Regulator feedbacks are leveraged to fully recapitulate 


diet restriction effects. 


simultaneously results in extremely long- 
lived animals that are refractory to DR. These 
results and dynamic simulations demonstrate 
that extensive feedback controls among 
regulators may be leveraged to drive the 


regulatory circuitry to a younger steady state, 
recapitulating the full effect of DR. 


The RNA-binding protein LIN-28 was first 
found to control developmental timing in 
Caenorhabditis elegans. Later, it was found 

to play important roles in pluripotency, 
metabolism, and cancer in mammals. Here we 
report that a low dosage of lin-28 enhanced 
stress tolerance and longevity, and reduced 
germline stem/progenitor cell number in 


C. elegans. The germline LIN-28- regulated 
microRNA let-7 was required for these effects 
by targeting akt-1/2 and decreasing their 
protein levels. AKT-1/2 and the downstream 
DAF-16 transcription factor were both required 
for the lifespan and germline stem cell effects 
of lin-28. The pathway also mediated dietary 
restriction induced lifespan extension and 
reduction in germline stem cell number. Thus, 
the LIN-28/let-7/AKT/DAF-16 axis we delineated 
here is a program that plays an important 

role in balancing reproduction and somatic 
maintenance and their response to the 
environmental energy level—a central dogma 
of the ‘evolutionary optimization’ of resource 
allocation that modulates aging. 
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Figure 4. LIN-28 requires DAF-16 to influence germline stem cell number and longevity. (A) RNAi of lin-28 does not extend the 
lifespan of the daf-16(mu86) mutant (log-ranktest P = 0.80). (B) RNAi of lin-28 increases daf-16::gfp(zls356) translocation at early L3 
stage. DAF-16 translocation status is quantified in three categories manually asrepresented by the exemplary pictures with three 
biological replicates. (C) The mRNA levels of the classical DAF-16 target genes sod3, mtl-1, hil-1, and lipI-4 increase upon lin-28 RNAi. 
Data represent mean SD (n = 3 biological repeats). *P < 0.05; **P < 0.0001. (D) Representative DAPI-stained and average number 
of proliferative zonenuclei in wild-type, lin-28(n719) under control and daf-16 RNAi treatment. **P < 0.01 by two-tailed Student's 
t-test. NS, not significant. (E) RNAi of lin-28 reducesproliferative germ cell number of wild-type worms significantly but does not 
change proliferative germ cell number of daf-16(mu86) mutant worms. *P < 0.05 by two-tailed Student's t-test. NS, not significant. 
(F) Proposed model of germline LIN-28 sensing the environmental energy level and acting through its downstream effectors let-7, 
AKTs,and DAF-16 to regulate reproduction and longevity by increasing the number of germline stem cells and repressing somatic 
maintenance. 
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Figure 5. A model for the opposing regulatory effects of lifespan-extending interventions and aging on ncRNA expression, 
transposon elements and mRNA. Within ncRNAs, many miRNAs repress chromatin modification genes, in particular 3 such 
miRNAs repress Chd1, which further regulates lifespan and aging related genes in two opposite ways, represses transposons 
(potentially as enhancers) and lifespan anti-correlated genes by enhancer binding and activates lifespan positively 


correlated genes by promoter binding. LS, lifespan. 


Lifestyle interventions, such as modulating 
dietary macronutrients, caloric intake, and 
energy expenditure, can considerably 
affect the susceptibility to aging-related 
diseases and, in some cases, an organism's 


lifespan. However, little is known about the 
mechanisms regulating the transcriptional 
program for longevity across multiple 
interventions, especially at the epigenetic 
level. 


With our model of dietary/lifestyle 
interventions that modify aging-related 
phenotypes, liver physiology, and mean/ 
maximum lifespan, we investigated midlife 
liver transcriptome changes. Specifically, we 
profiled the expression of both messenger 
RNAs (mRNAs) and ncRNAs, including 
miRNAs, long ncRNAs (IncRNAs), and 
transposable elements, by high-throughput 


deep sequencing. Strikingly, three dietary 
intervention network design patterns were 


uncovered: 1) lifespan extending interventions 
largely repressed the expression of miRNAs, 
IncRNAs, and transposable elements; 2) 
protein-coding mRNAs with expression 
positively correlated with long lifespan 

are highly targeted by miRNAs; 3) miRNA- 
targeting interactions mainly target chromatin- 
related functions. We experimentally validated 
miR-34a, miR-107, and miR-212-3p targeting 

of the chromatin remodeler Chd1 and 

further demonstrate that Chd1 knockdown 
mimics high fat diet and aging induced 

gene expression changes and activation of 
transposons. 


Together, these findings reveal a dramatic 
global repression of transposons by lifespan- 
extending interventions, safeguarding 
chromatin from leaky transcription and 
deregulation of gene expression, at least 

in part, through novel miRNA-chromatin 
remodeler interactions. 
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Future Perspective 


Epigenetic modifications play important roles 
in control of transcriptional and cellular states 
during development. Epigenetic modulations 
are also a key interface for receiving and 
recording cues from the environment and 
are potentially mediating the environmental 
influence over the rate of aging, which allow 
heritable and conditionally programmed 
gene to express from the genome. However, 
there is yet no systematic analysis on aging- 
dependent epigenetic changes up to now. To 
fill such a void, we will measure epigenomic 
and transcriptomic changes during human 
brain aging. Then, using these data, we will 
computationally infer and experimentally 
validate the underlying aging epigenetic 
regulatory network. In addition, based on 

the extensive lifestyle, physiological and 
pathological data of the participants, we will 
also examine the age-dependent epigenetic 
changes that might be associated with or 
influenced by lifestyle and other conditions. 


We will also perform similar analyses for C. 
elegans and mouse aging process under 
normal, caloric restriction and intermittant 
feeding conditions. And use C. elegans 
lifespan to validate the predicted regulators 
and regulatory interactions for the lifespan 
control. 


Il. Transcriptome dynamic 
change during embryonic 
development 


Current State of Research 


Pre-implantation embryogenesis 
encompasses several critical events including 
genome reprogramming, zygotic genome 
activation (ZGA), and cell fate commitment, 
of which most remain mechanistically 
unclear in primate. In addition, primate 
displays high rate of embryo wastage with 
unclear molecular basis. Understanding what 
factors are involved in the events of genome 
reprogramming and ZGA would benefit the 
generation of induced pluripotent stem cells 
with high efficiency. Moreover, discovering 
the molecular basis responsible for embryo 
wastage in primate would greatly expand 
our knowledge in species evolution. Here 
we carried out time-series RNA-seq in single 
and pooled rhesus monkey oocytes and 
pre-implantation embryos encompassing 
representative developmental stages. By 
comparing to human and mouse data, we 
found that the transcriptome dynamics of 
monkey oocytes and embryos were very 
similar to those of human, but very different 
from those of mouse. We identified several 
classes of maternal and zygotic genes, whose 
expression peaks were highly correlated with 
the time frames of genome reprogramming, 
ZGA and cell fate commitment, respectively. 
Importantly, comparison of the ZGA-related 
network modules among the three species 
revealed a looser surveillance of genomic 
instability in primate oocytes and embryos 
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Figure 6. Comparison of six groups of DNA damage response and HR-mediated repair genes. (A) Distribution of pair-wise Pearson 
correlation coefficient. Genes in BER, DDRC and HR groups were statistically different between primate and mouse (p-value<0.05, 
Kolmogorov-Smirnov test). (B) Mean log-transformed expression pattern of HR-related genes (EME1/Eme1, RAD51/Rad51, RAD54L/ 
Rad54l, RECQL/Recql, SHFM1/Shfm1, UBA2/Uba2 and XRCC2/Xrcc2) in human, monkey and mouse. (C) Examination of gamma 
H2AFX/H2afx and RAD51/Rad51 foci in untreated and etoposide treated GV oocytes recovered for 0 hr (Etop+0h) and 3 hr (Etop+3h). 
More gamma H2afx foci were observed in mouse GV oocytes than in monkey GV oocytes after etoposide treatment recovered for 
different time. Consistently, mouse oocytes accumulated Rad51 on damage sites, whereas few monkey oocytes had RAD51 foci 
formation. Images in square are enlarged in enlargement panels. (D) Quantification of fluorescence foci intensity of gamma H2AFX/ 
H2afx (upper panel) and RAD51/Rad51 (lower panel). Intensity was normalized by the number of oocytes examined. Data are 
represented as mean + SEM. Scale bar, 10 um. * p-value < 0.05, ** p-value < 0.001, *** p-value < 0.0001, two-tail t-test. 
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iTranscrgtome 


Figure 7. Overview of spatial transcriptome analysis 

in mouse mid-gastrulation embryo based on Geo- 

seq. we applied our new method named Geo-seq to 
mid-gastrulation mouse embryos to collate a spatial 
transcriptome resource. 3D quantitative data rendition 
enables spatial gene expression pattern visualization in a 
web-based database and identifies zip code marker genes for 
mapping single epiblast cell position in the embryo by gene 
expression profile concordance. 


than in rodent, in particular in the pathways 
of DNA damage signaling and homology- 
directed DNA double strand breaks repair. 


Animals have diverse body shapes with cells 
of distinct morphologies and functions. 

All the specific tissues and organs are 
derived following a blueprint established by 
gastrulation during the early embryogenesis. 
The molecular regulation of gastrulation is 
conserved for all vertebrate animals. However, 
the underlying molecular mechanisms, and 
particularly the genome-wide transcriptome 
profiling regarding the spatial variations are 
lacking. 


To answer these questions, we carried a 


comprehensive transcriptome analysis with 
high-resolution spatial information retained 
ona single mouse mid-gastrulation embryo. A 
technique named Geo-seq, which combines 
laser capture microdissection, low-input RNA- 
seq technology and efficient computational 
analysis strategies, was developed to acquire 
the gene expression profiles with precise 
spatial information on mid-gastrulation 
embryo. By mapping the expression of the 
whole transcriptome back to the embryo, the 
3D expression patterns of more than 20,000 
individual genes were revealed as a refined 
digitized whole mount in situ hybridization 
(d-WISH) image. Based on the gene expression 
profiles, the embryo could be spatially 
delineated into different expression domains, 
which are associated with various multipotent 
stem cell fates. 


The domain-specific markers were also 
identified to represent the molecular activities 
underlying the lineage regionalization of the 
epiblast. Importantly, these region-specific 
markers can be further used as zip-code for 
retrospective mapping of single cells or cell 
lines with unknown position to the mouse 
embryo. Besides that, the regulation network 
of transcription factor was constructed and 
enrichment analysis for signaling pathways 
was performed to help explain the mechanism 
of lineage determination. 


Finally, to benefit the community, a searchable 
database named iTrancriptome (www. 
itranscriptome.org) for this work was built up. 
The expression of every gene can be searched 
and viewed by specific patterns. The 3D 
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molecular map therefore will greatly facilitate 
the understanding of germ layer specification 
during gastrulation. Also, it is an invaluable 
resource for comparative studies in early 
human embryo development and diseases as 


well. 


Future Perspective 


Recent progress in single cell RNA-seq enable 
the delineation of paths toward cell lineage 
differentiation during embryo development. 
In future, we plan to utilize single cell RNA-seq 
to explore how the lineage construct; which 
factors involve in these complex processes. 


For mouse embryo development, we only 
focused on the epiblast in the middle 
gastrulation stage (E70) in recent studies. In 
the next step, we will add other two stages 
(E6.5 and E70) and also will add mesoderm 
and endoderm regions to perform the 
integrative transcriptome analysis of the whole 
gastrulation, such as identifying dynamic 
change pattern of domains and the potential 
regulators of different lineages. 


Ill. Computational 
Algorithm Development 


Current State of Research 


Accurate determination of genome- 
wide nucleosome positioning can 
provide important insights into global 
gene regulation. Here, we describe the 


development of an improved nucleosome- 
positioning algorithm—iNPS—which 
achieves significantly better performance 
than the widely used NPS package. By 
determining nucleosome boundaries more 
precisely and merging or separating shoulder 
peaks based on local MNase-seq signals, 
iNPS can unambiguously detect 60% more 
nucleosomes. The detected nucleosomes 
display better nucleosome ‘widths’ and 
neighbouring centre—centre distance 
distributions, giving rise to sharper patterns 
and better phasing of average nucleosome 
profiles and higher consistency between 
independent data subsets. In addition to its 
unique advantage in classifying nucleosomes 
by shape to reveal their different biological 
properties, iINPS also achieves higher 
significance and lower false positive rates 
than previously published methods. The 
application of iNPS to T-cell activation data 
demonstrates a greater ability to facilitate 
detection of nucleosome repositioning, 
uncovering additional biological features 
underlying the activation process. 


With the rapidly increasing availability of high- 
throughput in situ hybridization images, how 
to effectively analyze these images at high 
resolution for global patterns and testable 
hypotheses has become an urgent challenge. 
Here we developed a semi-automated 

image analysis pipeline to analyze in situ 
hybridization images of E14.5 mouse embryos 
at single-cell resolution for more than 1600 
telencephalon-expressed genes from the 
Eurexpress database. Using this pipeline, we 
derived the spatial gene expression profiles at 
single-cell resolution across the cortical layers 
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Figure 8./mproved nucleosome detection results by iNPS. (a-d) Genomic region 7,734,000-7,739,000 bp (hg18) of chromosome 

1 in human resting CD4 p T cells is shown as an example. (a) Nucleosome detection results by iNPS. (b) The nucleosome detection 
profile (wave-form signal within detected nucleosomes) by iNPS (red line). (c) Robustness of iNPS’s detection results. The SCC between 
nucleosome detection profiles derived from two sub-data sets (orange and green line) is 0.681. (d) Robustness of NPS’s detection 
results. The SCC between the nucleosome detection profiles derived from two sub-data sets (orange and green line) is 0.417. (e) 
Distribution of nucleosome ‘width'—the length between two inflection points of each detected nucleosome. (f) Distribution of the 
distance between the centre points of two neighbouring nucleosomes. 
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to gain insight into the key processes occurring 
during cerebral cortex development. These 
profiles displayed high spatial modularity 

in gene expression, precisely recapitulated 
known differentiation zones, and uncovered 
additional unknown transition zones or 
cellular states. In particular, they revealed a 
distinctive spatial transition phase dedicated 
to chromatin remodeling events during 
neural differentiation, which can be validated 
by genomic clustering patterns, epigenetic 


Our analysis further revealed a role of mitotic 
checkpoints during spatial gene expression 
state transition. As a novel approach to 
analyzing at the single-cell level the spatial 
modularity, dynamic trajectory, and transient 
states of gene expression during embryonic 
neural differentiation and to inferring 
regulatory events, our approach will be useful 
and applicable in many different systems for 
understanding the dynamic differentiation 
processes in vivo and at high resolution. 


modifications switches, and network modules. 


A Section selection 8 Image curation 


a 


Section 11 (or 14) 


Figure 9. Workflow to extract and digitize expression profiles. (A) The middle-most sagittal section (Section 11 or 14) was selected 
for expression profile measurements. The Eurexpress 3D online mouse embryo model is here used to illustrate the relative position 
of Section 11. (B) A MATLAB graphical interface was used for manually curating three radial lines across the cerebral cortex. Another 
line at the top left of the image was cropped for background correction. (C) Mean intensity values of every nine neighboring pixels 
(shown as dotted 3 x 3 squares) on the cropped line were extracted as a vector of smoothed expression intensities. The region of 
approximately a single cell (marked by the orange eclipse) can be captured with smoothed intensities in six pixels—the average 
length of a bin. (D) Each line was scaled to a 20-bin profile to represent the expression profile of a gene across the radial axis from 
CP to VZ. Then, the profiles for all telencephalon-expressed genes were summarized on a matrix of genes versus their expressions 

in 3 x 20 bins. (E) The representative Eurexpress ISH images with different expression patterns. In the zoomed-in panels (bottom to 
top, VZ to CP), the high-resolution ISH images display the signals and scales for single cells relative to the size of each radial bin as 
indicated by the 20-bin red rulers. Image sources are euxassay_007249_11.jpg, euxassay_009400_11.jpg, euxassay_017942_14.jpg, 
euxassay_003376_11.jpg, euxassay_018949_14.jpg, euxassay_009808_14.jpg, euxassay_009545_11. jpg, euxassay_004815_14.jpq, 
euxassay_019619_11.jpg, euxassay_008979_11.jpg, euxassay_011139_11.jpg, and euxassay_006007_11.jpg, respectively, from top to 
bottom and left to right. 
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Pediatric acute lymphoblastic leukemia (ALL) to reference the background distribution of 

is the most common neoplasm and one the marker genes in a large number of other 
of the primary causes of death in children. samples, and also enables cross platform 

Its treatment is highly dependent on the application. Here, we demonstrate that the 
correct classification of subtype. Previously, classifier can be extended from a microarray- 
we developed a microarray-based subtype based technology to a multiplex qPCR-based 
classifier based on the relative expression technology using the same set of marker 
levels of 62 marker genes, which can predict genes as the advanced fragment analysis (AFA). 
7 different ALL subtypes with an accuracy Compared to microarray assays, the new assay 
as high as 97% in completely independent system makes the convenient, low cost and 
samples. Because the classifier is based on individualized subtype diagnosis of pediatric 
gene expression rank values rather than ALL a reality and is clinically applicable, 

actual values, the classifier enables an particularly in developing countries. 


individualized diagnosis, without the need 


Miliacanm!| 


Figure 10. Hierarchical cluster of 240 microarray samples and 160 AFA samples. (A) Heatmap of 240 microarray samples. 
Expression levels of 57 marker genes were ranked from low to high in each sample. A high rank value represents a high expression 
value. The top color bar in the heatmap indicates the subtype each sample belongs to. (B) Heatmap of 160 AFA samples. The rank 
value was used as in (A). For the top color bar in the heatmap, the subtype bar indicates the real subtype for each sample, and the 
predict bar indicates the prediction results for each sample. 
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Future Perspective 


etwork biology still faces many challenges. 
Datasets are both data-rich and data-poor, that 
is, false positives and limited coverage are still 
the rule. The transition from model organisms 
to human means magnitudes of increase 


in complexity of both experimentation and 
computation. Most edges in network maps 
are still missing the signs and directions. Post- 
transcriptional modifications cannot be easily 
monitored at large scale. Tissue and cell type 
specificities are not considered. Genome-wide 
dynamic measurements are costly. However, 
with development of novel high-throughput 
and single cell dynamic measurement 
techniques and with improvement of accuracy 
and coverage over existing technologies, 
high-throughput experiments will continue 

to accelerate data acquisition and raise further 


need for data processing, integration, analysis 
and modeling. 
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methylation and its regulation, Nationa 
a , ? l Ageing at the Wellcome Trust Genome 
Science Foundation of China, Grant No. 


Campus in Hinxton (near Cambridge) UK, 
91519330 


May 18-20, 2015 
+ University College London (UCL), UK, May 
21, 2015 


+ Aging Phemome and Regulome in 
Caucasian and Asians in Washington 
State, US, National Ministry of Science and 
Technology, Grant No. 2016YFEO108700 


+ National University of Singapore Workshop 
on Networks in Biological Sciences, 
Singapore, June 8- 12, 2015 


+ International Computational Systems 
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Biology Meeting, August 20-26, 2015 
(Invited Session Organizer) 


+ Cold Spring Harbor Suzhou Aging and 
Diseases, Suzhou, China, September 14-18, 
2015 

+ Beijing University Quantitative Biology 
Center, Beijing, China, October 8, 2015 

+ Fudan University Medical School 
Symposium on Epigenetics, Shanghai, 
China, October 19-21, 2015 


+ Jiaotong-Yale University Joint Symposium 
on Big Data Analysis, Shanghai, China, 
November 13-14, 2015 


+ 2nd Military Medical University, Shanghai, 
China, March 16, 2016 


+ China Partner Meeting, Alberdeen, UK, April 
4-8, 2016 

+ Xiangshan Symposium on Bioinformatics, 
Beijing, April 14-15, 2016 

+ Fudan Biophysics Frontier Meeting, 
Shanghai, May 7-8, 2016 

+ 45th Annual Meeting of the American 
Aging Association, Seattle, US, June 3-5, 
2016 


+ 3rd Annual International Aging Forumatthe 
Basel Life Science Week, Basel, Switzerland, 
September 21-22, 2016 


+ Zing conference on Cell Fate Diversity in 
Aging, Dubrovnik, Croatia, September 25- 
28, 2016 

+ Cold Spring Harbor Suzhou Meeting on 
Epigenetics, Suzhou, October 10-14, 2016 
+ 3rd International Symposium on Genetics 
of Aging and Life History, Daegu, Korea, 
October 19-21, 2016 


« Target Validation meeting in Heidelberg, 
December 4-6, 2016 


+ 38th Lorne Genome Conference, Lorne, 
Victoria, Australia, February 12 -15, 2017 


+ 2nd Interventions in Aging Conference, 02- 
05 March 2017, Fiesta Americana Condesa, 
Cancun, Mexico 

+ 2017 CSHL meeting on Systems Biology: 
Networks, Long Island, NY, USA, March 14- 
18, 2017 

+ Keystone Symposium on Aging 
and Mechanisms of Aging-Related 
Disease. Yokohama, Japan May 15 - 20, 2017 

+ Gordon Research Conference meeting 


“Genome Architecture in Cell Fate & 
Disease”. Hong Kong, July 2-7, 2017. 


Organization of Scientific 
Events 


+ Cold Spring Harbor Suzhou Meeting 
on Aging and Diseases, Suzhou, China, 
September 14-18, 2015 
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2.3 Comparative Biology 


Researchers: 

Philipp Khaitovich (Director) Ghana 
Phone: +86 021 54920454 Rong Shen 
Email: khaitovich@eva.mpg.de Shuyun Huang 

oe F Yanan Guo 

Xiling Liu Zhengzong Qian 

Xi Jiang Meng Guo 

Graham L Banes Zihua Wu 

Kasia Bozek Weifeng Sun 

Patricia Guijarro Larraz Xiaode Yang 

Samuel Linsen ; 

M PETRONA 

Dingding Han Students: Gangcai Xie Qianhui Yu ae 2 

Song Guo Luhe Zhisong He Hindrike 

Haiyang Hu Jieyi Xiong Bin Zhang Bammann 

Jing Sun Yuning Wei Qian Li Lei zheng 

Research Molecular mechanisms of the 

; human brain evolution 

Overview 

Current State of Research 

Our group was established in September 2006 

as a Max Planck Independent Research Group. In the past three years we continued our work 

In September 2012, head group Dr. Philipp on analysis of mechanisms of human brain 

Khaitovich was promoted to a PICB director evolution by conducting transcriptome studies 

position. Thus, formally, this group has now with better special and temporal resolution 

transformed into a department. and studies involving yet unexplored RNA 

populations. This work was mainly conducted 

The main research focus of the department in collaboration with group of Prof. Dr. Svante 

largely follows the previous work of the Paabo at Max Planck Institute for Evolutionary 

group, with an additional new direction Anthropology, Leipzig, Germany 

of metabolome and lipidome research. 

Specifically, the main research directions of Study 1. Transcriptome composition and 

the department are: evolution of human neocortical layers 

+ Molecular mechanisms of the human brain 
evolution We conducted the first study of human- 
E specific transcriptome features in individual 
» Human metabolome and lipidome features layers of brain neocortex. Specifically, we 

Following is a detailed description of each of characterized the transcriptome of the cortical 

these research directions and of the progress layers and adjacent white matter in the 

made by the group in 2014-2017. prefrontal cortex of humans, chimpanzees 


and rhesus macaques using unsupervised 
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sectioning followed by RNA sequencing. More 
than 20% of detected genes were expressed 
predominantly in one layer, yielding 2,320 
novel human layer markers. While the bulk 
of the layer markers was conserved among 
species, 376 switched their expression to 
another layer in humans. By contrast, only 
133 of such changes were detected in the 
chimpanzee brain, suggesting acceleration 
of cortical reorganization on the human 
evolutionary lineage. Immunohistochemistry 


experiments further showed that human- 
specific expression changes were not limited 
to neurons, but affected a broad spectrum 
of cortical cell types. Thus, despite apparent 
histological conservation, human neocortical 
organization has undergone substantial 
changes affecting more than 5% of its 
transcriptome. 


This work was published in Nature 


Neuroscience. 


0000000000000000 0 


Figure 1. Schematic representation of the methodology for cortical layer transcriptome analysis 
based on unsupervised sectioning of cortical samples. The dissected prefrontal cortex samples 
included a full cross-section of the gray matter (GM) and underlying white matter (WM). 


Study 2. Gene expression features of 
cortical development characteristic of 
autism and unique to humans 


We conducted a follow up investigation 
focusing on expression of synaptic genes 


during human prefrontal cortex development. 


Previously, we demonstrated that the high 


expression period of synaptic genes is 
substantially extended (more than 5-fold) 

in the human brain compared to brains 

of other primates, including chimpanzees 
and shifter to a later developmental period 
even after correction for the difference in 
the species maximal lifespan. We how show 
that expression of synaptic genes in autism, 
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a common cognitive disorder affecting a synaptic genes, especially the ones linked to 
set of brain functions believed to be unique disease by genetic studies, peaks early during 
to humans, is severely disrupted during development and then drops rapidly, thus 
the first years of postnatal development. potentially leading to abnormal neuronal 
Specifically, we show that the evolutionary network formation. 


novel shift to a later developmental interval 


observed in healthy humans is not present This work was published in PLoS Biology. 


in autism patients. Instead, expression of 
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Figure 2. Relationship between human-specific and autism-related developmental changes. (A) A global view 
of transcriptional similarity among autism and control cases, as well as chimpanzees and macaques, visualized using 
MDS with expression correlation as distance. Each circle represents an individual. The size of the circles is proportional 

to the individuals’ age (smaller circles correspond to younger individuals). The colors represent different species (black 

— autism cases; red — controls; blue — chimpanzees; green — macaques). The first dimension correlates with hominid- 
monkey divergence (Pearson correlation, r = 0.80, p < 0.001) and the second — with age (Pearson correlation, r = 0.81, p 

< 0.001). (B) Numbers of genes showing age-related and species-specific developmental profiles (red — human controls; 
blue — chimpanzees). (C) Overlap between genes showing human-specific developmental profiles and six major clusters 
of expression changes in autism. The y-axis shows relative numbers of overlapping genes calculated as the log, ratio 
between the observed gene numbers and the numbers expected by chance, calculated as described above. The symbols 
above the bars show the significance of the overlap based on 1,000 permutations (***: p < 0.001). (D) Expression profiles 
of genes showing human-specific developmental profiles and expression divergence between autism and control cases 
in cluster 2 measured by RNA-seq (left panel) or microarrays (right panel). The x-axis shows the age information on the 
(age)? scale, the y-axis shows the expression levels standardized to mean = 0 and standard deviation = 1 before plotting. 
The points represent mean expression levels in each individual (red — controls; black — autism cases; blue — chimpanzees, 
green — macaques), the lines show cubic spline curves fitted to the individual data; the error bars show standard deviation 
of the spline curves. The numbers of genes used for plotting are shown on top of the panels. 


DIRECTORS’ RESEARCH GROUPS 


Study 3. Evolutionary changes in snoRNA 
and snRNA abundance in human and 
mammalian cortex 


In this study we investigated the expression 
evolution of snRNAs and snoRNAs by 
measuring their abundance in the frontal 
cortex of humans, chimpanzees, rhesus 
monkeys, and mice. Although snRNA 
expression is largely conserved, 44% of the 


0_5 10 15 (log2 RPKM) 


185measured snoRNA and 40%of the 134 
snoRNA families showed significant expression 
divergence among species. The snRNA and 
snoRNA expression divergence included 
drastic changes unique to humans: A 10- 

fold elevated expression of U1 snRNA and a 
1,000-fold drop in expression ofSNORA29. The 
decreased expression of SNORA29 might be 
due to two mutations that affect secondary 
structure stability. 
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Figure 3. The human-specific expression pattern of SNORA29 and the underlying genetic mechanism. 
(A) Expression of snoRNA families showing expression changes specific to human, chimpanzee and rhesus 
evolutionary lineages: HL — human lineage, CL - chimpanzee lineage, RL — rhesus monkey lineage. (B) Expression of 


SNORA29 in four species. 


Using in situ hybridization, we further localized 
SNORA29 expression to nucleolar regions of 
neuronal cells. Our study presents the first 
observation of snoRNA abundance changes 
specific to the human lineage and suggests 

a possible mechanism underlying these 
changes. 


This work is now published in Genome Biology 
and Evolution. 


Study 4. Decoupling of mRNA and 
protein expression profiles in cortical 
development 


We surveyed mRNA and protein expression 
changes in the prefrontal cortex of humans 


and rhesus macaques over developmental 
and aging intervals of both species’ lifespan. 
We found substantial decoupling of mRNA 
and protein expression levels in aging, but not 
in development. Genes showing increased 
mRNA/protein disparity in primate brain aging 
form expression patterns conserved between 
humans and macaques and are enriched in 
specific functions involving mammalian target 
of rapamycin (mTOR) signaling, mitochondrial 
function and neurodegeneration. 
Mechanistically, aging-dependent mRNA/ 
protein expression decoupling could be linked 
to a specific set of RNA binding proteins and, 
to a lesser extent, to specific microRNAs. 


Genes targeted and predicted to be targeted 
by the aging-dependent posttranscriptional 
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Figure 4. Metabolome evolution. Left panel: general analysis of metabolite concentration variation among human 
(red), chimpanzee (yellow), macaque (gray) and mouse (blue) samples in different tissues. Central panel: proportion of 
metabolic divergence on the four evolutionary lineages in different tissues. Right panel: results of the pulling strength 
test conducted in healthy human, chimpanzee and macaque subjects. Diamonds represent males, squares — females. 


Symbols size is proportional to individual's age. 


regulation are associated with biological 
processes known to play important roles 

in aging and lifespan extension. These 
results indicate the potential importance of 
posttranscriptional regulation in modulating 
aging-dependent changes in humans and 
other species. 


This work is now published in Genome Biology. 


Human metabolome and 
lipidome features 


Current State of Research 


In the past three years we expanded this 
research direction by conducting a number 
of studies base on an integrative analysis of 
transcriptome, metabolome and lipidome 
composition of tissue samples from humans 
and other species. This work was mainly 
conducted in collaboration with group of Prof. 
Dr. Svante Paabo at Max Planck Institute for 
Evolutionary Anthropology, Leipzig, Germany 
and group of Dr. Patrick Giavalisco at Max 
Planck Institute of Molecular Plant Physiology, 
Golm, Germany. 


Study 1. Human metabolome and lipidome 
features on brain 


We conducted analysis of hydrophobic and 
water-soluble compounds, as well as RNA 
transcripts, in three brain regions, as well as 
kidney and muscle, of humans, chimpanzees, 
rhesus macaques and mice. Use of optimized 
LC-MS-based metabolome and lipidome 
assessment procedures, allowed us to identify 
and quantify more than 14,000 chemical 
compounds present in brain, kidney and 
muscle samples. At the level of water-soluble 
metabolites, the largest excess of metabolite 
concentration changes on the human 
evolutionary lineage was found in two tissues: 
prefrontal cortex of the brain and skeletal 
muscle. Surprisingly, the excess of human- 
specific changes in muscle was even greater 
than in brain. To test whether this enormous 
human-specific change in muscle metabolism 
may affect muscular performance, we 
conducted strength tests in humans, 
chimpanzees in macaques. We indeed find 
that human muscular performance in this test 
was on average two times lower than that of 
chimpanzees and macaque monkeys. Thus, 
the study of metabolic evolution has revealed 
a novel human-specific phenotype: a drastic 
decrease in muscular strength compared to 
other primate species. 


This work was published in PloS Biology. 
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At the level of hydrophobic metabolites (lipids) 
we observed two unexpected phenomena. 
Firstly, the lipid composition of the brain 
differed drastically from the lipid composition 
of the two non-neural tissues, kidney and 
muscle, used in our study. Secondly, lipids 
preferentially present in brain evolved much 
more rapidly among species than the ones 
under-represented in brain or ubiquitously 
present in all tissues. Furthermore, lipids 
preferentially present in brain, but not the 


ones depleted in brain, showed a particularly 


Lipidome divergence among human tissves 
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Tissue-specific concentration differences 


rapid pace of concentration changes on the 
human evolutionary lineage (Figure 2). This 
suggests that lipidome remodeling played 
roles in the emergence of brain functionality 
unique to humans. We show that lipidome 
composition of all three brain regions differs 
drastically among species. Furthermore, 
lipids preferably present in brain compared 
to non-neural tissues evolve approximately 
five times faster among species that the lipids 
preferentially present in kidney and muscle. 


This work was published in in Neuron. 
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Figure 5. Lipidome evolution. Left panel: lipid concentration variation among human tissues. Central panel: Number and 
percentage of lipids showing tissue-specific concentration profiles. Right panel: number and percentage of brain-enriched 
(BE) and brain-depleted (BD) lipids showing significant concentration divergence on the human (red) and the chimpanzee 


(orange) evolutionary lineages. 


Study 2. Lipid composition changes 
during prefrontal cortex development and 
maturations in humans, chimpanzees, and 
macaques 


We have continued the analysis of lipidome 
composition of developing primate brain 
using human, chimpanzee and macaque 
prefrontal cortex as a representative brain 
region. Using liquid chromatography coupled 
with mass spectrometry (LGMS) we analyzed 
concentrations of approximately 12,000 
hydrophobic compounds in 40 humans, 


40 chimpanzees and 40 rhesus monkeys 

with ages spanning the entire postnatal 
development. We further analyzed gene 
expression changes in the same samples 
using RNA-sequencing. We find that 7,589 

of the 11,772 quantified lipid peaks change 
concentrations significantly along the lifespan. 
More than 60% of these changes occur 

prior to adulthood, with less than a quarter 


contributed by myelination progression. 
Evolutionarily, 36% of the age-dependent 
lipids exhibit concentration profiles distinct 
to one of the three species, 488 (18%) of 
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them — unique to humans. In both humans 
and chimpanzees, the greatest extent of 
species-specific differences occurs in early 
development. Human-specific lipidome 
differences, however, persist over most of the 
lifespan and reach their peak from 20 to 35 
years of age, when compared to chimpanzee- 
specific ones. 


This work was published in Molecular Biology 
and Evolution. 


Study 3. Analysis of human- and ape- 
specific features of transcriptome and 
metabolome development in mice 
carrying a human GLUD2 gene 


We analyzed functions of human and ape 
specific glutamate dehydrogenase gene 
(GLUD2) during postnatal brain developemnt. 
In thus project, effects of GLUD2 on 
metabolite and gene expression levels during 
cortical development were investigated 

in the knock-in mouse model, as well as 
humans and macaques. The results show 
unexpected involvement of GLUD2 in brain 
lipid biosynthesis. Specifically, in this study the 
human genomic region containing the GLUD2 
gene was inserted into mice and analyzed 

the resulting changes in the transcriptome 
and metabolome during postnatal brain 


development. Effects were most pronounced 
early postnatally and affected predominantly 
genes involved in neuronal development. 
Remarkably, the effects in the transgenic 

mice partially parallel the transcriptome and 
metabolome differences seen between 
humans and macaques analyzed. Notably, the 
introduction of GLUD2 did not affect glutamate 
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Figure 6. Metabolome analyses of transgenic mice 
and primates. (A) The mean normalized metabolite 
concentration divergence between transgenic and 
control mice, based on the 24 metabolites sharing the 
same KEGG pathway as the 13 genes with expression 
affected by the GLUD2 genotype (dark blue curve) 
and the remaining 86 detected metabolites (light 
blue curve). The colored area shows variation of the 
divergence estimates of the 86 metabolites obtained 
by bootstrapping 1,000 times. (B) Pathway analysis of 
the GLUD2 genotype effect on mouse metabolome. 
The red circles show metabolite concentration 
divergence between transgenic and control mice, 
based on metabolites in a pathway. The boxplots 
show metabolite concentration divergence between 
transgenic and control mice, calculated by sampling 
1,000 times the same number of metabolites as 
detected in a given pathway from the bulk of remaining 
detected metabolites. Each pathway contains at 
least two detected metabolites and at least one gene 
differentially expressed between transgenic and 
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control mice. The filled circles show the pathways 
with significantly greater metabolic divergence than 
expected by chance. (C) The schematic representation 
of the three KEGG pathways showing significantly 
greater metabolite concentration divergence between 
transgenic and control mice: HIF-1 signaling pathway, 
pentose phosphate pathway, and carbon metabolism. 
Detected metabolites are shown in blue and genes 
with expression affected by the GLUD2 genotype 

are shown in red. Dashed pink rectangle delineates 
the pathway affected by GLUD2 overexpression in 
IDH1-mutant glioma cells. (D) The mean normalized 
metabolite concentration divergence between 
humans and macaques based on the 11 of the 24 
metabolites linked to the GLUD2 genotype effect in 
mice (dark blue curve) and the remaining 56 detected 
metabolites (light blue curve). The colored area 

shows variation of the divergence estimates of the 56 
metabolites obtained by bootstrapping 1,000 times. 


levels in mice, consistent with observations in 
the primates. Instead, the metabolic effects 
of GLUD2 center on the tricarboxylic acid 
cycle, suggesting that GLUD2 affects carbon 
flux during early brain development, possibly 
stimulating lipid biosynthesis. 


This work was published in PNAS. 


Study 4. Lipidome determinants of maximal 
lifespan in mammals 


We conducted a study focusing on lipidome 
features linked to another phenotypic trait 
associated with humans — long lifespan. 
Maximal lifespan of mammalian species, even 
if closely related, may differ more than 10-fold, 
however the nature of the mechanisms that 
determine this variability is unresolved. Here, 
we assess the relationship between maximal 
lifespan duration and concentrations of more 
than 20,000 lipid compounds, measured 

in 669 tissue samples from 6 tissues of 35 
species representing three mammalian clades: 
primates, rodents and bats. We identify lipids 
associated with species’ longevity across 


the three clades, uncoupled from other 
parameters, such as basal metabolic rate, 
body size, or body temperature. These lipids 
clustered in specific lipid classes and pathways, 
with the polyunsaturated lipids showing 
elevated concentration levels in the long-living 
species. E, and enzymes linked to them se 
lipids display signatures of greater stabilizing 
selection in long-living species, and cluster 

in functional groups related to signaling and 
protein-modification processes. These findings 
point towards the existence of defined 
molecular mechanisms underlying variation in 
maximal lifespan among mammals. 


This work was conducted in collaboration with 
many groups, including group of Prof. Dr. Gary 
R. Lewin at Max-Delbrück Center for Molecular 
Medicine, Berlin, Germany. 


This work was published in Scientific Reports. 
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Figure 7. Dataset description. (A) Phylogenetic tree of the 35 species used in this study. Colors show the clade identity and 
the MLS of species, with darker shades representing longer MLS. (B) Species’ MLS distribution in years. Each dot represents 

a species. Colors are as in panel A. (C) Number of lipid compounds measured in each tissue. (D) Relationship between 
species’ MLS and body mass. The colors are as in panel A. Open circles indicate species deviating from the linear relationship 
between MLS and body mass. The dashed line shows linear model fit to the remaining species (F-test, p < 0.05). MLS was 
normalized to the maximal MLS value within each clade. NMR — naked mole-rat, HM — human, GS - grey squirrel, CBWB — 
common bent-winged bat, RBFB — rickett’s big-footed bat, CHB — chinese horseshoe bat. (E) Number of lipid compounds of 


each tissue after removal of the compounds related to the confounding factors. 
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protein expression decoupling reveals RNA 
binding proteins and miRNAs as potential 
modulators of human aging. Genome Biol. 
2015 Feb 22;16:41. doi: 10.1186/s13059- 
015-0608-2. 


+ Bozek K, Wei Y, Yan Z, Liu X, Xiong J, 


Sugimoto M, Tomita M, Pääbo S, Sherwood 
CC, Hof PR, Ely JJ, Li Y, Steinhauser D, 
Willmitzer L, Giavalisco P, Khaitovich 
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P. Organization and evolution of brain 
lipidome revealed by large-scale analysis of 
human, chimpanzee, macaque, and mouse 
tissues. Neuron. 2015 Feb 18;85(4):695-702. 
doi: 10.1016/j.neuron.2015.01.003. Epub 
2015 Feb 5. 


+ Su ZD, Sheng QH, Li QR, Chi H, Jiang X, Yan 


Z, FUN, He SM, Khaitovich P, Wu JR, Zeng R. 
De novo identification and quantification 
of single amino-acid variants in human 
brain. J Mol Cell Biol. 2014 Oct;6(5):421-33. 


+ Hu HY, He L, Khaitovich P. Deep 


sequencing reveals a novel class of 
bidirectional promoters associated with 
neuronal genes. BMC Genomics. 2014 Jun 
10;15:457. 


+ Bozek K, Wei Y, Yan Z, Liu X, Xiong J, 


Sugimoto M, Tomita M, Pääbo S, Pieszek R, 
Sherwood CC, Hof PR, Ely JJ, Steinhauser 
D, Willmitzer L, Bangsbo J, Hansson O, Call 
J, Giavalisco P, Khaitovich P. Exceptional 
evolutionary divergence of human muscle 
and brain metabolomes parallels human 
cognitive and physical uniqueness. PLoS 
Biol. 2014 May 27;12(5):e1001871. 


+ He Z, Bammann H, Han D, Xie G, 


Khaitovich P. Conserved expression of 
lincRNA during human and macaque 
prefrontal cortex development and 
maturation. RNA. 2014 May 20 


- Khrameeva EE, Bozek K, He L, Yan Z, 


Jiang X, Wei Y, Tang K, Gelfand MS, Prufer 
K, Kelso J, Paabo S, Giavalisco P Lachmann 
M, Khaitovich P. Neanderthal ancestry 
drives evolution of lipid catabolism 

in contemporary Europeans. Nature 
Commun. 2014 Apr 1;5:3584. doi: 10.1038/ 
ncomms4584. 


+ Zhao G, Guo S, Somel M, Khaitovich 


P. (2014) Evolution of human longevity 
uncoupled from caloric restriction 
mechanisms. PLoS One. 2014;9(1):e84117. 


Cooperation 


+ Development, regulation and evolution of the 
human metabolome and lipidome, Prof. Dr. 
Lothar Willmitzer and Dr. Patrick Giavalisco, 
Max Planck Institute of Molecular Plant 
Physiology, Germany. 

+ Molecular mechanisms of human brain 


evolution, Prof. Dr. Svante Paabo, Max Planck 
Institute for Evolutionary Anthropology, 
Germany. 

+ Molecular mechanisms of autism and 
association between human-specific brain 
features and autistic dysfunction, Prof. Dr. 
Schahram Akbarian, Mount Sinai Hospital, 
USA. 

+ Regulation Mechanisms of Human Aging, 
Prof. Dr. Mehmet Somel, Ankara Technical 
University, Turkey. 

+ Algorithms for proteome and metabolome 
analysis, Prof. Dr. Oliver Kohl bacher, 
Tubingen University, Germany 

+ Proteome analysis in human evolution and 
disease, Prof. Dr. Rong Zeng, Shanghai 
Institute for Biological Sciences, China. 

+ Role of specific regulators of cortical 


synaptogenesis in human brain evolution, 
Prof. Dr. Zilong Qiu, Shanghai Institute for 
Biological Sciences, China. 


External Funding 


+ Molecular basis of human age-related 


immune decline, National Science 
Foundation of China, Grant No. 31171232 


+ Human brain metabolic networks in aging 


and common neuropsychiatric disorders, 
MOST Grant No. 2012DFG3194 


+ plan for foreign experts 


+ Regulation mechanisms of human brain 
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aging, CAS, Grant No. GJHZ201313 


+ Systematic analysis of evolutionary 
mechanisms underlying inter- and intra- 
species variation during postnatal brain 
development, National Science Foundation 
of China, Grant No. 91331203 

+ The evolutionary study on primate brain 
organization ,CAS, Grant No. XDB13010200 


+ Development, regulation and evolution 


research of the human brain lipidome, Grant 
No. 31420103920. 


Teaching (2014-2017) 


» Human Evolutionary Genetics, course for 1** 


year graduate students at the Shanghai 
branch of the CAS graduate school, March- 
June 2014 


- Bioinformatics Algorithms, course for 1% year 


graduate students at the Shanghai branch 
of the CAS graduate school, September- 
December 2015 
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3 Max-Planck Independent Research Groups 


3.1 Plant Systems Biology 


Researchers: Students: 
Dr. Xinguang Zhu (Group Leader) Honglong Zhao 
Phone: +86-21-5492 0486 Shuyue Wang 
Email: zhuxinguang@picb.ac.cn Faming Chen 
Tiangen Chang 
Qingfeng Song (Staff Scientist) Yi Xiao 


Mingnan Qu (Staff Scientist) 

Xinyu Liu (Staff scientist) 
Guangyong Zheng (Staff Scientist) 
Sisi Wang (Staff Scientist) 

Mingzhu Lyv (Staff Scientist) 
Essemine Jemaa (postdoc) 
Shahnaz Perveen (postdoc) 


Research institute annual report. Here briefly describe 
them again. 
Overview 


The ePlant project: Photosynthesis is 
Photosynthesis is the process by which plants 


utilize solar energy to convert CO, and H2O 
into carbohydrates and releases O. Via this 


recognized as a feasible approach to 
dramatically increase crop yields. However, 
given the complexity of photosynthesis, 
process, plants are the ultimate source of all E ; 
the traditional experimental approach, i.e., 
testing the effects of different combinations 


of enzyme activities on photosynthetic 


of our food and all of fossil fuels, and have the 
potential to play a critical role in producing 

bl d in mitigati 
o a A is À EEES efficiency is an unrealistic and insufficient 
E RE Wee AR approach. In contrast, employing the deep 
knowledge of the photosynthetic process in 


systems integration and quantitative approach 


developing next generation computational 
models and algorithms to identify new 
approaches to adapt photosynthesis to meet 
societal needs and to improve photosynthesis 
beyond the accomplishments of evolution. 


based on mathematical models provide an 
efficient means to tackle such issues, as well 
demonstrated by indispensible role of the 


Specifically, we are working on two projects, . ae 
p y g S systems modelling and design in the current 


representing two different approaches , A aa 
p g PP integrative circuit industry. The overall goal of 


with the shared goal of engineering higher ea si 
the ePlant project is to develop mechanistic 
photosynthetic energy conversion efficiency, 
i.e, optimizing existing C, photosynthetic 
systems (the ePlant project) and engineering 
C, photosynthetic pathway into C; crops (the 
C, rice project). The rationale and concepts 


of both projects were detailed in the 2012 


models of plant primary metabolism 
(photosynthesis, respiration, nitrogen uptake 
and assimilation, nitrogen metabolism, and 
water movements through the plants from 
the soil to root, stem, leaf until its release 
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into the atmosphere) and its regulation, and 
integrate these individual models into a 
functioning systems model (Figure 1). Once 
built, the e-Plant model will be used as a 
research platform to study systems properties 
of plant metabolic and regulatory processes, 
to study molecular mechanisms underlining 
the adaptation and evolution of plant traits 
(genetic, metabolic and structural properties 
related to productivity) under different 
environments, to identify new ways to 


engineer higher plant productivity. 


Figure 1. The ePlant model schematic. The overall ePlant 
model will include not only the plant primary metabolism 
(photosynthesis, respiration, nitrogen assimilation and 

HO uptake and transport processes), but also the major 
regulatory processes and the transport of material among 
different organs. ePlant will play a key role in guiding 
engineering of both food and energy crops to increase 
productivity. 


C, rice project: Except in cool temperature 
conditions, C, photosynthesis has higher 
photosynthetic energy conversion efficiency 
compared to C, photosynthesis. Since 

the discovery of C, photosynthesis in the 
1960s, significant amounts of research 

have been conducted into the study of the 
biochemistry, development and genetics of 
C, photosynthesis. Biochemically, the major 


difference between C, and C, photosynthesis 
is that C, photosynthesis has a CO, 
concentrating mechanism, which increases 
the CO, concentration around Rubsico and 
correspondingly suppressing photorespiration 
(i.e, oxygenation) and increasing the net 

CO, fixation rate (A). The CO, concentrating 
mechanism depends on the cooperation of 
two specialized cell types, i.e. the bundle 
sheath (BS) and mesophyll (M) cells. In BS cells, 


the concentration of Rubisco is high while 
PSII activity is low; whereas in M cells, the 
concentration of PEP carboxylase (PEPC), PSII 


and PSI activities are all maintained at high 
levels and Rubisco activity is absent (Sage 
2004). Besides the differences in biochemistry 
between these two cell types, C, plants 
have also evolved an efficient metabolite 
transportation system between the two cell 
types (Leegood 1999); furthermore, the cell 
wall of bundle sheath cells are thickened to 
form a Kranz structure. Numerous lines of 
evidences suggest that engineering the C, 
pathway into C; crops is feasible, e.g., certain 
tissues in C} tobacco plants conduct C,-type 
photosynthesis, Eleocharis vivipara is able to 
switch between C, and C, photosynthesis 
dependent on environments (See review in 
Zhu et al. 20100). 


Converting C; photosynthesis in the major 
staple crop, e.g. rice, is a feasible approach 
to dramatically increase crop yield and 
correspondingly contribute to alleviating 
the poverty and hunger of millions in 
food insecure parts of the world. In 


addition, enhanced conversion to C, would 
significantly improve nitrogen and water 
use efficiencies. To reach this goal, the Bill 
& Melinda Gates Foundation (BMBF) have 
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funded a major C, rice project, led by the 
International Rice Research Institute, with 
the ultimate goal of “supercharging" rice 
with the C, photosynthetic machinery (Sage 
and Zhu 2011). This BMGF funded C, rice 
project assembled a multidisciplinary team 
composed of top researchers with expertise 
in molecular biology, genetics, physiology, 
biochemistry and mathematics (Figure 2). This 
team was designed to accelerate progress 

in understanding C, photosynthesis and 

to develop new techniques and targets for 


genetic engineering. 
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Figure 2. Members of the C, rice consortium. These 
members are funded by the Bill & Melinda Gates Foundation 
to engineer the C, photosynthetic mechanism into rice, which 
holds potential to increase rice yields up to 40% (Courtesy of 
Paul Quick). 


Current State of Research 


The ePlant Project 


Substantial progress has been made in the 
development of the ePlant project in the past 
three years. The major achievements in the 
past three years are as following: 


1. Development of a highly mechanistic model of 
leaf internal light environment. 


One major layer of models in the ePlant 
project is to develop an integrative model 
of leaf photosynthesis. Leaf photosynthetic 
physiology is determined by both 
biochemical properties and anatomical 
features. Previously, though the significance 
of biochemical variations to photosynthetic 
efficiency has been well studied, however, 
the functional significance of leaf anatomical 
features to photosynthetic efficiency has 
not been quantitatively studied. Here 

we developed a generic leaf structural 
functional model, which can be used to 
evaluate leaf internal light environment and 
its implications for leaf potential electron 
transport rates for leaves with defined 
anatomy. This new model includes a) a 
three-dimensional representation of basic 
components of a leaf including epidermis, 
palisade and spongy tissues, as well as the 
physical dimensions and arrangements 

of cell walls, vacuoles and chloroplasts; b) 
an efficient forward ray-tracing algorithm 
predicting internal light environments for 
ight of wavelengths between 400-2500 
nm. As the first application, we studied 

the influence of leaf anatomy and ambient 
ight on leaf internal light conditions and 
photosynthetic properties. Results show 
that 1) different chloroplasts even within the 
same cell can experience drastically different 


ight conditions and photoinhibition 

may occur in some chloroplasts even 

under moderate ambient light levels; 

2) chloroplasts in a leaf experience very 
different biochemical limitations, which may 
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Figure 3. The geometry of a typical dicot leaf. (a) The size of modeled leaf section is about 150 mm x 150 mm x 180 mm. (b) 
modeled geometry of a leaf with chloroplast in the face position. (c) model geometry of a leaf with chloroplasts in profile position. (d) 
model geometry of a leaf where bundle-sheath extensions (BSEs) making up 20% of total leaf volume. 


underlie the convexity of the light response 
curve; 3) bundle sheath extension can 
enhance nitrogen use efficiency of leaves; 4) 
chloroplast re-localization can enhance light 
use efficiency, in addition to its known role 
of photodamage avoidance under high light 
(Xiao et al., 2017 Journal of Experimental 
Botany). 


2.Development of an integrated dynamic systems 
model of canopy photosynthesis and explore 
potential of leaf chlorophyll concentration on 
canopy photosynthesis 


Canopy photosynthesis, which correlates 
with biomass production, includes photo- 
synthesis of all leaves in a canopy, includ- 
ing leaves of upper layers and lower layers. 
Since leaves at the top layer of a canopy 


are usually light saturated while leaves at 
the bottom layer are usually light limited, 
both of which can potential decrease the 
canopy photosynthetic CO, uptake rates. 
Improving light distribution inside a canopy 
therefore holds great potential to dramati- 
cally increase total canopy CO, uptake rate. 
One of the major factors contributing to 
the improved crop yield during the green 
revolution is improved canopy light environ- 
ment through breeding canopies with more 
erect leaves. We theoretically explore the 
potential to improve canopy photosynthe- 
sis through manipulating leaf chlorophyll 
(Chl) content. To do this, we developed an 
integrated canopy photosynthesis model 
including canopy architecture, a ray tracing 
algorithm, and models of photo-acclimation 
and C, photosynthesis. 
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Figure 4. Integrated canopy model combining canopy structure, ray tracing and metabolic model. 


Simulations with this model showed that 
1) the efficiency of photosystem II (PSII) 
increased when leaf chlorophyll content 
was decreased by decreasing antenna size; 
2) decreasing leaf chlorophyll content 
increased light intensity on leaves at both 
top and bottom layers of a canopy; (3) 


the canopy CO, uptake rate (A) increased 
over 3% and nitrogen use efficiency of a 
canopy (NUE) increased over 14% when 
leaf Chl content was deceased to 40% by 
reducing light harvesting complex; (4) 

the A. and NUE were increased over 30% 
and 30% respectively when the nitrogen 
saved by decreasing Chl to 40% through 
reducing LHC was re-distributed to other 
components of the photosynthesis; (5) 
decreasing leaf chlorophyll content can 
improve A. and NUE under different 

leaf area index and under different 
temperature and light conditions. This study 
demonstrates that optimizing chlorophyll 
concentration can be an effective option to 
improve canopy photosynthesis, biomass 
production and crop yield potential. 


With these theoretical results as basis, we 
have created transgenic rice with decreased 


expression levels of chelatase, which is a key 
enzyme involved in chlorophyll synthesis. 
The initial results suggest that rice lines with 
decreased chlorophyll concentration have 
improved photosynthetic efficiency and 
water use efficiency. 


3. Development of a three-dimensional 


ray-tracing model of sugarcane canopy 
photosynthesis and its application in assessing 
impacts of varied row spacing. 


As a direct application of our canopy 
photosynthesis model, we have recently 
used it to demonstrate that better 
agronomic practices can be designed with 
a crop specific canopy photosynthesis 
model. Sugarcane has emerged as the 
second largest source of biofuel, primarily 
as ethanol produced in Brazil. Dual-row 
planting using asymmetric spacing of 
rows can decreases damage to plants and 
soil structure from harvest equipment 
though potentially can cause some loss of 
productivity due to increased shading. We 
developed a computational framework 
which couples 3D canopy architectural 
information, a ray tracing algorithm, and a 


Asymmetric planting 


sem. ee 
ae en e N 
mo e 
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steady-state C, photosynthesis model to 
study the optimal row spacing to maximize 
canopy photosynthesis and hence biomass 
production. We demonstrate the utility of 
the model by comparing evenly spaced 
rows at 100 cm to alternating row spacing 
of 45 and 155 cm. Asymmetric planting 
caused a 9.5% decrease in predicted net 
canopy carbon uptake over the growing 
season for a major current cultivar. The 
loss was greater at lower leaf area indices, 
when leaves were more vertical and when 
rows were oriented east-west, suggesting 
agronomic approaches to minimize loss. 
This study demonstrates the utility of this 
computational framework, which could 
also be used to aid breeding by identifying 
ideotypes for different environments 

and objectives, and to assess impacts of 
environmental change (Wang et al., 2017). 


Symmetric planting 


Figure 5. Canopy photosynthesis model for the 
biofuel crop sugarcane. 


4. An integrated canopy multi-physics model to 
simulate microclimate and study influences 
of wind speed, relative humidity and stomatal 
conductance on transpiration 


Improving water use efficiency (WUE) of 


current cultivars is a major focus of crop 


improvement. Many efforts have been 
devoted to studying factors influencing 
WUE and various factors, such as stomatal 
conductance and canopy architecture. 
Many efforts have been devoted to 


developing canopy m 


odels and use them 


to explore options to further increase 


canopy water use efficiency. However, 
so far, there has been no mechanistic 
model of canopy water use efficiency 


which accurately desc 
of water transfer in ac 


ribes the process 
anopy with full 


consideration of the interaction between 


stomatal conductance, leaf temperature, 


ambient humidity, and air flow inside a 


canopy. Here we developed a novel 3D 


canopy multi-physics 


model for rice, which 


describes a 3D geometry model of rice 


canopy and physical processes coupling 


fluid dynamics, heat transfer, convection 
and diffusion. The model successfully 


simulated the microcli 


canopy, such as water 
and temperature. Part 
and convective fluxes 
a canopy can be quan 
The result suggested 
flux is more than 100 


mates inside the 
vapor concentration 
icularly, the diffusive 
of water vapor in 
tified separately. 


that the convective 


times higher than 


the diffusive flux of water vapor (Figure 
6). Using our model, we explored the 


influence of different wind speed, relative 
humidity, and stomatal conductance on 


canopy conductance and transpiration. This 


model therefore provi 


des an integrated 


approach to study canopy transpiration and 


photosynthesis. It also provides a new tool 


to study factors influencing plant logging 


resistance. 
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Figure 6. Convective (A) and diffusive (B) flux of water vapor released from leaves were 


simulated in the canopy and the color bar for convective flux is 100 times of diffusive flux. 


5. Development of an integrative model of source 


sink flow in a whole plant 


After we have finished the leaf level 
model, canopy level model, we have 
spent substantial amount of time in 
and recently completed developing an 
integrative model of source, sink, and flow, 
which is the last key element for the final 
completion of the ePlant model. This new 
model can precisely predict crop growth 


and developmental processes and hence 
crop yield and quality under genetic 
manipulations. It includes a molecular 
level description of reproductive phase 

of crop growth and can mechanistically 
predict source sink interaction and hence 
productivity. So far, we have parameterized 
this model with rice, which is a major 
staple crop globally. This model, named 


WACNE, integrates major metabolic 
reactions involved in carbon and nitrogen 
assimilation, assimilate transport, storage, 
organ growth and senescence processes 
(Fig 7). WACNE successfully predicted 

the commonly observed biochemical, 
physiological and growth patterns of 
major rice organs during the grain filling 
period under different environmental or 
genetic manipulations. With this model, 
we identified four sets of genes which 
have drastically different impacts on rice 
gain yields, i.e. universal yield enhancer 
(UYE), universal yield inhibitor (UYI), case- 
specific yield enhancer (CYE) and weak 
yield regulator (WYR), which 80% of them 
validated by transgenic experiments. 


Ww 


WACNE can be used to guide future rice 
design, engineering and breeding. WACNE 
can also be easily adapted to develop 
growth and development models for other 
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Figure 7. Diagram showing the metabolic and physical 


processes in the WACNE model. Processes are numerically 
labeled in the closed circle: 1, root nitrogen absorption; 2, 

root nitrogen assimilation; 3, root growth; 4, root senescence; 
5, leaf photosynthetic CO, and nitrogen assimilation; 6, leaf 
triose phosphate, sucrose and starch interconversion; 7, leaf or- 
ganic nitrogen and protein interconversion; 8, leaf senescence; 
9, grain volume growth (endosperm cell proliferation); 10, 
grain starch and protein accumulation; 11, stem storage pool 
suc and starch, organic nitrogen and protein interconversion; 
12, osmotic pressure driven long distance phloem transport 
of sucrose and organic nitrogen; 13, transporter-dependent 
short distance phloem transport of sucrose and organic 
nitrogen (phloem loading and unloading); 14, symplastic dif- 
fusion between phloem and stem storage pool. Abbreviations 
in figure: Suc, sucrose; TP, triose phosphates; I-N, inorganic 
nitrogen; O-N, free-form organic nitrogen; HATS, high-affinity 
nitrogen transport system; LATS, low-affinity nitrogen trans- 
port system; SSP: stem storage pool; LR: light reaction; CBC, 
Calvin-Benson cycle. The green region represents chloroplast, 
the white space at the bottom represents the root tissue, the 
grey region surrounding roots represents soil. 


crops. 


6. Plant in silico — An integrative framework to 


support plant systems biology research 


Right now, most of the current plant 
systems modeling research is conducted 
in an isolated manner. In another word, 
the models are developed by individual 
labs by specialized experts. A paradigm 
shift is needed and is timely in moving 
plant modeling from largely isolated 
efforts to a connected community effort 
that can take full advantage of advances 
in computation science and mechanistic 
understanding of plant processes. Plants in 
silico (Psi) envisions a digital representation 
of layered dynamic sub-models, reaching 
in a modular framework from gene 
networks and metabolic pathways through 
to cellular organization, tissue and organ 
development, and resource capture in 
dynamic competitive environments. 
Ultimately this will allow a mechanistically- 
rich simulation of the plant or community 
of plants in silico. The concept is to integrate 
models from different organization layers 
spanning from genome to phenome to 
ecosystem in a modular framework that 
would allow use of sub-models of varying 
mechanistic details representing the 

same biological process. Developments 

in high-performance computing (HPC), 
functional knowledge of plants and 
open-source version controlled software 
make achieving the concept realistic. The 
latter feature is designed to enhance 
collaboration and move toward testing 
and consensus on quantitative theoretical 
frameworks. Importantly it provides a 
quantitative knowledge framework where 


the implications of a discovery at one level, 
e.g. single gene function or development 
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response, can be examined at the whole 

plant or even crop and natural ecosystem 

levels. 
In summary, in the e-Plant project, we have 
now developed an integrative mdoel of leaf 
internal light environment, an integrated 
model of canopy photosynthesis, coupling 
leaf metabolic processes with canopy 
architecture together with the energy balance, 
developed a highly mechanistic model of 
source sink flow processes. All these now form 
a solid foundation to develop the complete 
dynamic systems model of crop growth 
and development: ePlant. We also proposed 
the Plant in silico concept, which holds the 
potential to be a major platform to support 
the future plant systems biology research. 


C, Rice Project 


In the past ~2 years, we have made substantial 
progress in the following topics: 


1. Define the major biochemical and anatomical 
controlling efficiency of rice with engineered C, 
metabolism 


During the C, engineering project, one 
question that remains to be answered 

is whether expressing a C, metabolic 
cycle into a C, leaf structure and without 
removing the C; background metabolism 
improves photosynthetic efficiency. To 
explore this question, we developed a 3D 
reaction diffusion model of bundle-sheath 
and connected mesophyll cells in a C3 rice 
leaf. Our results show that integrating a C, 
metabolic pathway into rice leaves with a 
C3 metabolic cycle and mesophyll structure 


may lead to an improved photosynthesis 
under current ambient CO, concentrations. 
We analyzed a number of physiological 
factors that influence the CO, uptake rate, 
which include the chloroplast surface area 
exposed to intercellular air space, bundle- 
sheath cell wall thickness, bundle-sheath 
chloroplast envelope permeability, Rubisco 
concentration, electron transport capacity 
and the energy partitioning between 
C, and C, cycles (Fig. 8). Among these 


partitioning of energy between C; and 
C, photosynthesis and the partitioning of 
Rubisco are decisive factors controlling 
photosynthetic efficiency in an engineered 
C-C, leaf (Wang et al., 2016) 
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Figure 8. Schematic overview of the biochemical reactions in 

a rice plant expressing a C, metabolism. OAA: oxaloacetic acid; 
PEP: phosphoenolpyruvate; PEPC: phosphoenolpyruvate carboxylase; 
NADP-MDH: malate dehydrogenase; NADP-ME: NADP-malic enzyme; 
PPDK: pyruvate and phosphate dikinase. All metabolites shown in the 
diagram can diffuse between different compartments in both meso- 
phyll and bundle sheath cells. Metabolites, reactions and enzymes are 
indicated in black. Blue arrows: CO, flux. 


2. Preconditioning of C, photosynthesis 


C, photosynthesis evolved independently 
from C3 photosynthesis in more than 60 
lineages. Most of C, lineages are clustered 
together in the order Poales and the 
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order Caryophyllales while many other 
angiosperm orders do not have C, existence, 
suggesting the existence of biological pre- 
adaptations in the ancestral C3 species that 
facilitate the evolution of C, photosynthesis 
in these orders. To explore metabolic 
pre-adaptations from C, towards C, 
photosynthesis, here we classified lineages 
into the C,-poor and the C,-rich groups 
based on their C, species percentage and 
conduct comprehensive comparisons on 
transcriptional levels between those non-C, 
species from the C,-poor and the C,-rich 
groups. Species in the C,-rich group show 
increased expression of genes related 

to oxidoreductase activity, light reaction 
components, C, cycle related genes and 
Nudix hydrolases. In addition, metabolic 
features identified in the C,-poor group 
but not in the C,-rich group are predicted 
to be potential metabolic obstacles that 


decrease the possibility of C, photosynthesis 


evolution. The identified metabolic 
obstacles included the up-regulation of 

a PEP/Pi translocator, genes related to 
signaling pathway, stress response, defense 
response and plant hormone metabolism 
(ethylene and brassinosteroid). This suggests 
that the C, ancestors in the C,-poor group 
may have taken different mechanisms to 
cope with those stress conditions that 


favor C, photosynthesis evolution. This 
study provides new insights regarding the 
evolution of C, photosynthesis (Tao et al., 
2016). 


3. Evidence for the role of transposons in the 
recruitment of cis-regulatory motifs during the 
evolution of C, photosynthesis 


C, photosynthesis evolved from C, 
photosynthesis and has higher light, water, 
and nitrogen use efficiencies. Several C, 
photosynthesis genes show cell-specific 
expression patterns, which are required 
for these high resource-use efficiencies. 
However, the mechanisms underlying 
the evolution of cis-regulatory elements 
that control these cell-specific expression 
patterns remain elusive. We tested the 
hypothesis that the cis-regulatory motifs 
related to C, photosynthesis genes are 
recruited from non-photosynthetic 


genes and further examined potential 
mechanisms facilitating this recruitment . 
We examined 65 predicted bundle sheath 
cell-specific motifs, 17 experimentally 
validated cell-specific cis-regulatory 
elements, and 1,034 motifs derived from 
gene regulatory networks. Approximately 
7,5, and 1,000 of these three categories 
of motifs, respectively, were apparently 
recruited during the evolution of C, 
photosynthesis. In addition, we checked 


1) the distance between the acceptor and 
the donors of potentially recruited motifs 

in a chromosome, and 2) whether the 
potentially recruited motifs reside within 
the overlapping region of transposable 
elements and the promoter of donor genes. 
The results showed that 7, 4, and 658 of 

the potentially recruited motifs might have 
moved via the transposable elements. 


Furthermore, the potentially recruited 
motifs showed higher binding affinity to 
transcription factors compared to randomly 
generated sequences of the same length 

as the motifs. This study provides molecular 
evidence supporting the hypothesis that 
transposon-driven recruitment of pre- 
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existing cis-regulatory elements from non- 
photosynthetic genes into photosynthetic 
genes plays an important role during C, 
evolution. The findings of the present 
study coincide with the observed repetitive 
emergence of C, during evolution (Cao et 
al., 2016). 


Figure 9. Distribution of motifs in genes before and after 


recruitment. a) Before the recruitment, the donor (orange 


dot) locates in a neighboring gene of the acceptor (purple dot) 
in a gene regulatory network (GRN). b) Before the recruitment, 
the donor contains the motif (orange block) and the accepter 


(purple block) lacks the motif. c) After a copy-and-paste 
recruitment, the accepter (purple block) contains the motif 
(orange block) while the donor also maintains the motif. d) 


After a cut-and-paste recruitment, the accepter (purple block) 


recruits the motif (orange block) while the donor loses the 
motif. 


4. Systems analysis of cis-regulatory motifs in C4 


photosynthesis genes using maize and rice leaf 


transcriptomics data during a process of de- 
etiolation 


Identification of potential cis-regulatory 
motifs controlling the development of C, 
photosynthesis is a major focus of current 
C, photosynthesis research. We used time- 
series RNA-seq data collected from etiolated 
maize and rice leaf tissues sampled during 

a de-etiolation process to systematically 
characterize the expression patterns of 
C,-related genes and to further identify 
potential cis elements in five different 
genomic regions (i.e. promoter, 5‘UTR, 
3’UTR, intron, and coding sequence) of C, 
orthologous genes. The results demonstrate 
that although most of the C, genes show 
similar expression patterns, a number of 
them, including chloroplast dicarboxylate 
transporter 1, aspartate aminotransferase, 
and triose phosphate transporter, show 
shifted expression patterns compared 


with their C, counterparts. A number of 
conserved short DNA motifs between maize 
CG, genes and their rice orthologous genes 
were identified not only in the promoter, 
5’UTR, 3°UTR, and coding sequences, but 
also in the introns of core C, genes (Fig. 10) 
. We also identified cis-regulatory motifs 
that exist in maize C, genes and also in 
genes showing similar expression patterns 
as maize C, genes but that do not exist in 
rice C, orthologs, suggesting a possible 
recruitment of pre-existing cis-elements 
from genes unrelated to C, photosynthesis 
into C, photosynthesis genes during C, 
evolution. 
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Figure 10. Diagram showing numbers and species 

of conserved DNA motifs between maize and rice. 
Conserved DNA motifs identified with the k80 are marked 
with color blocks as indicated in figure legend. The number of 
mapped sites are labeled above the small blocks. Overlapped 
results between k80 and k30 approaches are marked in bold. 


5. Establish an updated phylogenetic tree of the 
Flaveria genus to study the molecular event 
during C, evolutionary process 


The genus Flaveria has been extensively 
used as a model to study the evolution 
of C, photosynthesis as it contains C, 

and C, species as well as a number of 
species that exhibit intermediate types of 
photosynthesis. The current phylogenetic 
tree of the genus Flaveria contains 21 of 
the 23 known Flaveria species and has 
been constructed using a combination of 


morphological data and three non-coding 
DNA sequences (nuclear encoded ETS, 

ITS and chloroplast encoded trnL-F). We 
developed a new strategy to construct an 
updated phylogenetic tree of 16 Flaveria 
species based on RNA-Seq data. The 
updated phylogeny is largely congruent 


with the previously published tree but with 
some modifications. We propose that the 
data collection method provided in this 
study can be used as a generic method 

for phylogenetic tree reconstruction if the 
target species has no genomic information. 
We also showed that F. pringlei used in a 
number of labs now are hybrid of original F. 
pringlei (C,) and F. angustifolia (C,-C,). We 
propose that the new strategy of obtaining 
phylogenetic informative sequence from 
this study can be used to construct the 
tree of life in a larger number of taxa. The 
updated Flaveria phylogenetic tree also 
supports the hypothesis of the stepwise 
and parallel evolution of C, photosynthesis. 


6. The order in which C, biochemical components 


were gradually acquired in grasses 


C, plants possess a 4-carbon acid CO, 
concentrating mechanism (C,-CCM) 

that increases CO, concentration around 
RuBisco, which decreases photorespiration. 
The metabolic genes involved in C,-CCM 
were evolved from their non-C, paralogs 
and recruited into C, photosynthesis 
during evolution. However, the recruitment 
sequence of these genes remained largely 
unknown. By surveying recent phylogenetic 
analyses and theoretical simulations, we 
proposed here that the acquisition of C, 
metabolic genes in grass started with 
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Figure 11. Schematic diagram showing the evolution order of biochemical components in C4- 
CCM (Tao et al, Submitted to Journal of Experimental Botany). 


the co-option and adaptation in PEPC, 
followed by the acquisition of NADP- 

ME in decarboxylation and optimized 

with additional PEPCK. We systematically 
summarized the evidences underlying this 
recruiting order of three C, metabolic genes 
and studied potential implications on C,- 
CCM evolution (Figure 11). 


In summary, we have now defined a 
number of key structural and metabolic 
features which need to be considered in 
the future C, engineering efforts. A number 
of key factors related to the evolution of C, 
photosynthesis, such as the preconditioning 
of C, photosynthesis, the potential role of 
transposon during the recruitment of pre- 
existing cis-regulatory elements, have been 
identified. Finally, we have now established 
an updated phylogeny of a model genus 
for C, photosynthesis research, i.e. Flaveria. 
Understanding the molecular events during 
the C, emergence will be the focus of the 


next phase of C, photosynthesis research. 


Future Perspective 


ePlant project. 


After eight years’ of research in the ePlant 
project, now we have developed modules 
covering processes scaling from cellular, 

leaf, canopy up to the whole plant levels. 

It is time now to move to the next step, i.e. 
to develop method to effectively use the 
ePlant framework to support the future 

Plant systems biology research community. 
During the model developmental process, 
we also identified a number of promising 
targets which can be engineered to gain 
higher photosynthesis and hence biomass 
production. For example, we found that 
decreasing antenna size of photosystem can 
potentially increase canopy photosynthesis. In 
the next few years, we will test the effects of 
decreasing antenna size in increasing canopy 
photosynthesis and crop yield potential 
using transgenic approach. Furthermore, 

we will identify molecular markers that are 
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related to leaf chlorophyll concentration 

using both natural rice accessions and 
genetic populations. Finally, we will conduct 
extensive genome wide association studies of 
a number of other important photosynthetic 
traits, which will be related to canopy 
photosynthetic efficiency. Specifically, we will: 


1. Develop an updated model of plant primary 


metabolism 


Photosynthesis is not work in isolation in 
plants. Therefore, it interacts closely with 
respiration and nitrogen metabolism. 
Right now, using the ePhotosynthesis 

as a basis, we will develop a complete 
dynamic systems model of plant primary 
metabolism. This model will not only be 
used as a basis to study the interaction 
between photosynthesis, respiration and 
nitrogen metabolism, it will also form a basis 
to evaluate the potential new engineering 
options to leaf photosynthetic efficiency. 
We have started this work two years ago in 
the C, rice project. 


2. Develop a pipeline to link ePlant model to 
a particular rice cultivar and to genomic 
variations 


Now we have finished a complete dynamic 
model of plant growth and development. 
The challenge now is to develop effective 
method to link this model to particular crop 
and even further to particular crop cultivars. 
Once this is done, we can use the model 

to pinpoint the exact target to engineer for 
enhanced photosynthesis and productivity. 
Furthermore, we need to develop effective 
method to link the systems model directly 


to genomic variations, which will be the 

key to enable application of such models 

in modern crop breeding and engineering. 
The evolution and artificial selection 

related to major metabolic and regulatory 
processes of crops can be viewed through 
this new systems model of crop growth and 
development. 


3. Test whether decreasing antenna size can 


increase canopy photosynthesis in the field. 


We will engineer rice to decrease leaf 
chlorophyll concentration using multiple 
approaches, including decreasing 
chloroplast number, decrease the 
expression of enzymes involved in 
chlorophyll synthesis and decrease the 
expression of chelatase. Furthermore, 
we will using both natural and genetic 
population to identify molecular 
markers associated with leaf chlorophyll 
concentrations. 


4, Genome wide association studies of important 


parameters related to photosynthesis. 


We will use a minicore rice population 
developed by USDA to conduct detailed 
genome wide association studies of 

key photosynthetic parameters. These 
parameters include gas exchange 
parameters, fluorescence parameters under 
both high light and low light conditions. 
We will also measure the cyclic electron 
transfer rates in this population. We will 
then use transgenic approach to further 
test the function of those identified genes 
from GWAS studies. In addition, we will use 
similar rice populations to examine factors 
controlling photosynthate partitioning 
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into growth and maintenance respiration 
and also partitioning of photosynthate to 
grain filling during the grain filling stage. 

A number of important genes have been 
identified already through this approach. 

In the next few years, we will systematically 
evaluate the importance of these genes in 


controlling photosynthetic efficiency and 
also their potential in rice breeding. 


5. Develop the plant in silico framework 


We will develop a generic plant in silico 
framework to support the future plant 
systems biology research globally. The 
framework will include not only the basic 
modules, but also the major algorithms 
used for model development, parameter- 
ization, validation and application. The 
framework will be linked to major national 
or international computational resources 
as well. The envisaged functions and tools 
incorporated in the Plant in silico frame- 
work are detailed in Zhu et al. (2015). 


C, rice project. 

We will continue improving our pipelines 

to use high throughput data to identify key 
signals related to C4 development and study 
the potential significance of having different 
amount of PCK activities in C, plants. 


1. Study the molecular evolution of C, 
photosynthesis. 


We will construct a phylogenetic network 
analysis of the Flaveria genera. The Flaveria 
genera include not only many C, species, 
but also many C, species and a number 
of intermediate C.-C, species, making 
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it an ideal genera to study C, evolution. 
We have obtained the required genomic 
sequences in different Flaveria species 
through collaboration with Dr. Julian 
Hibberd from Cambridge University. This 
analysis will provide insights regarding the 


steps during C, evolution. Once we have 
an updated phylogenetic tree of Flaveria, 
we will conduct detailed comparative 
studies of the transcriptomics in a number 
of species in this genera. We will develop 


the transformation protocols for different 
species of this genera, which will be used to 


test the hypotheses regarding C, evolution. 
Furthermore, we will use the Salsola genus 
to study the molecular evolution of C, 
photosynthesis as well, given that this genus 
includes species in which the cotyledon 
and leaf have different photosynthetic 


types. The role of CO, concentration on C, 


development will also be examined. 


2. Sequencing the genome of major species in the 


Flaveria genus 


We will conduct genome sequences for 
representative Flaveria species in the next 
few years. With the genome sequence 
information, we will mine new regulatory 
elements during the emergence of C, 
photosynthesis. The transformation system 
for Flaveria will also be established to 
enable detailed functional studies using the 
Flaveria system. 


3. Testing a new theory of the evolution of C, 


photosynthesis 


Current understanding of the C, evolutions 
mainly assumes that the new traits were 
acquired during evolution gradually 
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following a strict sequence. However, this 
is rather unlikely considering the large 
number of possible sequences can lead 

to the current C, photosynthesis. We will 
conduct a series of experiments to test 
the possibility of another scenario, i.e., the 
major features of C, photosynthesis occurs 


biochemical and anatomical features will be 


examined to test this hypothesis. 


4. Test the physiological significance of mixtures 
of decarboxylases in the same C, plants under 
different conditions 


Wang et al (2014) showed that the C, 
photosynthesis should be classified either 


as NADP-ME sub-type or NAD-ME sub-type. 


However, PEPCK, another decarboxylases 
usually co-exists with either NADP-ME or 
NAD-ME enzymes in most of C, species. 
Given that there are different degree of 
mixtures of PEPCK with either NADP-ME 
or NAD-ME, the physiological significance 
is however unknown. We will examine 

in detail the impacts of having different 
mixtures of these decarboxylases to 
metabolic fluxes and energy conversion 
efficiency under different light and CO, 
conditions. 


5. Construct the genetic regulatory network of C, 
and C, photosynthesis 


More and more evidences suggest that the 
features related to C, photosynthesis are 
controlled by more than one regulators, 
eg. transcription factors and small 

RNAs, etc., therefore elucidation of the 
genetic regulatory network related to 


independently and later combined together 
to form C, photosynthesis. Large number of 


C, photosynthesis is crucial for the final 
understanding of the molecular mechanism 
underlying the development of C, 

features. To complement existing efforts of 
identifying regulators of C, photosynthesis 
through large scale comparative 
transcriptomics, comparative genomics 

and also other traditional molecular biology 
approaches, such as yeast one hybrid or 
yeast two hybrid experiments, we will 
construct genetic regulatory network 
related to photosynthesis using network 
biology approaches. Specifically, we will 
use mutual information based algorithm to 
reconstruct the genetic regulatory networks 
for rice, setaria and maize, which will be 
used to identify potential regulators related 
to C, photosynthesis. The functions of 


these regulators will then be tested through 
transgenic experiments. 
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LaRue, Ying Shao, Zehong Ding, Qi Sun, 
Rohan V. Patel, Robert Turgeon, Xin- 
Guang Zhu, Nicholas J. Provart, Todd C. 
Mockler, Alisdair R. Fernie, Mark Stitt, Peng 
Liu, Thomas P. Brutnell (2014) Comparative 


analyses of C, and C, photosynthesis in 
developing leaves of maize and rice. Nature 
Biotechnology 32: 1158-1165. 


- Xianbin Yu, Guangyong Zheng, Lanlan 


Shan, Guofeng Meng, Martin Vingron, Qi 
Liu, Xin-Guang Zhu (2014) Reconstruction 
of gene regulatory network related to 
photosynthesis in Arabidopsis thaliana. 
Frontiers in Plant Sciences. doi: 10.3389/ 
fpls.2014.00273. 


+ Yi-Bo Chen, Tian-Cong Lu, Hong-Xia Wang, 


Jie Shen, Tian-Tian Bu, Qing Chao, Zhi-Fang 
Gao, Xin-Guang Zhu, Yue-Feng Wang, 
Bai-Chen Wang (2014) Posttranslational 
modification of maize chloroplast pyruvate 
orthophosphate dikinase reveals the 
precise regulatory mechanism of its 
enzyme activity. Plant Physiology 165: 534- 
549. 


+ Yuanyuan Li, Jiajia Xu, Noor Ul Haq, 
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Hui Zhang, Xin-Guang Zhu (2014) Was 
low CO2 a driving force of C, evolution? 
Arabidopsis responses to long-term low 


CO, stress. Journal of Experimental Botany 65: 


3657-67. 


Cooperation 


ePlant project 
+ Global Yield Potential Consortium 


+ Prof. Mark Stitt, Max Planck Institute for 


Molecular Plant Phsiology, Germany 


+ Prof. Steve P. Long, University of Illinois at 


Urbana Champaign, United States 


+ Prof. Donald R Ort, University of Illinois at 


Urbana Champaign, United States 


+ Dr. Dingyang Yuan, State Key Laboratory of 


Hybrid Rice, China 


+ Dr. Chengcai Chu, Institute of Genetics and 


Developmental Biology, CAS, China 


C, rice project 

+ C, Rice Consortium 

+ Dr. Eric Schranz, Wageningen University 
+ Dr. Luonan Chen, SIBS 


External Funding since 2014 
+ C, Rice Phase Ill. Oct 2015 — Dec 2019. OPP 


1129902. 6 999 794 S. Bill and Melinda Gates 
Foundation. 


+ Model guided OTL analysis of the traits 


related to canopy photosynthesis. 900 
000 RMB, C020401. National Science 
Foundation. 2014.1.1-2017.12. 31 


+ Ministry of Science and Technology. 


National High Technology Development 


Plan (863 plan). 2014AA101601. 2014.1.1- 
2018.12.31 


+ The mechanism of photosynthesis and 


improving the light use efficiency of 
crops. The National Basic Research and 
Development Plan of China, Ministry 
of Science and Technology of China. 
2015CB150104. 2015.11-2020.11, Co-PI. 


« A complex system biology approach to 


understand the factors affecting canopy 
photosynthesis. GJHZ1501. CAS-CSIRO 
Cooperative Research Program. January, 
2015 — December, 2017. PI. 1,000,000 RMB. 


+ Modeling and simulation of rice high 


yield and stable yields. XDA08020301. CAS 
Strategic Research Plan “Designer Breeding 


by Molecular Modu 
December 2017, PI. 


+ RIPE — Realizing Incr 


es”, August 2013 — 


14,626,300 RMB. 


eased Photosynthetic 


Efficiency. Bill & Melinda Gates Foundation. 
OPP1060461. 25,000,000 dollars. Co-PI. 


December 10th, 20 
2017. 


2 — December 10th, 


+ C4-Rice Phase 2: Supercharging 


Photosynthesis. Bill and Melinda Gates 


Foundation. Reinvestment 51586. 
13,000,000 dollars. Co-PI. May 2012 — May 


2015. 


Teaching 


+ Bioinformatics and computational biology 


— motif finding and 


HMM, CAS-SIBS. 


e Photosynthesis, PICB 


+ Plant Physiology, on 
in SIBS 


C/C, photosynthesis 
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Conference Presentations 
and Invited Talks 


+ Plant in silico. Dec 17, 2016. The Sixth 
Biological Interdisciplinary Research Forum, 
Beijing Unviersity, Beijing, China. 

- Theory and practise of high-yield high- 
efficeincy crop breeding under global 
climate change. Dec 5th, 2016, SanNong 
Agriculture Seminar Series, Huazhong 
Agriculture Unviersity, Wuhan China. 


+ Plant in silico: Multiscale systems modeling 
of photosynthesis for improved productivity. 
Nov 14th 2016, Pacific Northwest National 
Laboratory, Pasco, WA. 


+ The rice idealtype for breeding. Symposium 
for cross-strait young plant biologists. Sep 23- 
Sep 26, Hangzhou city, Zhejiagn Province, 
China. 


e The role of phenomics in rice breeding in 
the post-genomics era. The Fist Asia-Pacific 
Symposium on Plant Phenomics. Beijing, 
China, Oct 19-21" 2016. 


+ Systems biology of photosynthesis. Lecture 
series of Plant Molecular Genetics.The 
Institute of Genetics and Developmental 
Biology, CAS, Beijing, Sep 215%, 2016. 

+ Introduction to modleing plant 
productivity. Environmental and 
Ecophysiological field techniques. Lisbon, 
Portugal, 12-16" September 2016. 


+ Modeling canopy photosynthesis 
and productivity. Environmental and 
Ecophysiological field techniques. Lisbon, 
Portugal, 12-16" September 2016. 


+ The roles of phenomics in modern rice 
breeding in the post-genomics era. 
Symposium on developing phenomics 


platform to support rice genetics breeding. 
Sep 6" 2016, Changsha, China. 


+ Photosynthate partitioning: a critical 


factor controlling crop biomass and yield 
potential. The 2016 Annual Meeting of the 
Society of the Experimental Biology. July 
7-10" 2016, Brighton, UK. 


+ Towards a rice sytems model - the 


modules, algorithms, and applications. 
Realizing Plant in silico. Univeristy of Illinois 
at Urbana Champaign. May 18-20", 2016, 
Urbana, IL, USA. 


» Redox and C, photosynthesis. C,-50 


Symposium. April 10-14" 2016. Canberra, 
Australia. 


+ Systems biology research of 


photosynthesis. The 18" Haixia Biological 
Research forum, Fujian Agriculture and 
Forest University, Dec 10° 2015, Fuzhou. 


+ Evolution versus artificial design for 


enhanced photosynthetic effieincy. 
The 7" Asia and Oceania Conference on 
Photobiology. Nov 15-18", 2015, Taiwan. 


+ Redesigning crops for increased yield and 


resource use efficeincy. Global Plant Council 
Stress Resilience Symposium. Oct 23-25" 
2015, Foz do Iguasu, Brazil. 


- Molecular mechanisms of C, evolution. 


2015 National Congress of Plant Biology, Oct 
9-12" 2015, Changchun, Jilin, China. 


+ Development of a generic crop growth and 


development model to support biomass 

crop breeding. Perennial biomass crops for 
a resource constrainted world: Biomass 2015, 
Sep 7*™ — 10% 2015, Stuttgard Hohenheim, 
Germany. 


+ Modeling photosynthesis at cell and tissue 


levels. Steve Long FRS Symposium. University 
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of Essex, June 15" 2015, Colchester, UK. 


+ The role of low CO, during evolution of C, 


photosynthesis. The second international 
conference on photorespiration: 
Photorespiration — Key to Better Crops. June 
1s-4th 2015, Warnemünde, Germany. 


e Using systems and synthetic biology 


approach to improve photosynthesis and 
crop yield potential. Chinese Academy of 
Agricultural Sciences. March 8" 2015, Beijing, 
China. 


+ Modelling of sink-source partitioning. 


Cereals, biomass and biofuels cuonference. 
March 6-7** 2015, Beijing, China. 


+ Systems and synthetic biology of 


photosynthesis. University of Illinois at 
Urbana Champaign, Feb 17'” 2015, Urbana, 
IL. 


+ Systems model of rice. Shandong Normal 


University. Jan 23'¢ 2015, Jinan, Shandong, 
China. 


+ The role of low CO, during the evolution 


of C, photosynthesis. Plant and Animal 
Genome Meeting 2015. Jan 10-14"? 2015, San 
Diago, California. 


+ ePlant: a modeling framework to support 


engineering crops for improved water and 
radiation use efficiencies. 2014’ Yangling 
International Agri-science Forum. November 
4-7" 2014. Northwest Agricultural and 
Forest University, YangLing, Shanxi, China. 


+ Modeling photosynthesis at multiple 


scales. Plant Systems and Synthetic Biology: 
solutions to global food security. National 
Carolina State University, October 17-18" 
2014, Raleigh, North Carolina. 


- Are there three catorgies of C, 


photosynthesis in nature? The 11th China 


national congress on plant biology. August 
5-8", 2014, Guiyang, China. 


+ Rationale design of photosynthetic 


metabolism for increased efficiency. The 3rd 
international congress on plant metabolism. 
July 2-6th, 2014, Xiamen, China. 


+ Rationale design of photosynthetic 


metabolism for increased efficiency. 
Biology of Chloroplast-towards a blueprint 
for synthetic organelles, June 21-26", 2014, 
Puttusk, Poland. 


» Relcassification of C, photosynthesis 


in Nature. May 16th, 2014, South China 
Agricultural Unviersity, Guangzhou. 


+ Reclassficiation of C, photosynthesis in 


Nature. 2014 China National Photosynthesis 
Congress. April 15-18", 2014, TaiAn, China. 


* Special features of C, photosynthesis in 


Setaria viridis. The First International Setaria 
Genetics Conference, March 10-12", 2014, 
Beijing, China. 
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3.2 Population Genomics 


Researchers: 


Dr. Shuhua Xu 
Group Leader 


Staff: 


Jing Pu (Research Assistant, since 2011) 
Jing Li (Associate Professor, 2012-2015) 


Yuan Yuan (Associate Professor, since 
2012) 


Yan Lu (Research Associate, since 2012) 
Wenfei Jin (Research Associate, 2012 - 


Students: 


2013 
E ce MecOe iG ue Dongsheng Lu (2009-2016) Xi Zhang (ShanghaiTec Uni.) (since 2013) 
20 i Meng Shi (2009 - 2016) Xixian Ma (since 2014) 


iaj i Zhendong Wu (ShanghaiTec Uni.) (since 2014 
Pankaj Kumar (Postdoctoral Fellow,  tiging Fu (since 2010) Eae : : ? u i 
2012-2014) Lian Deng (since 2011) Jiaojiao Liu ShanghaiTec Uni.) (since 2014) 


Asif Ullah Khan (Postdoctoral Fellow, | Yuchen Wang (since 2011) Chang Liu (since 2015) 


since 2013) Kai Yuan (since 2012) Yang Gao (ShanghaiTec Uni.) (since 2015) 
Jiawei Shen (Postdoctoral Fellow, since — Chao Zhang (since 2012) Zhilin Ning (since 2016) 
2016 Qidi Feng (since 2012) Wanxing Xu (ShanghaiTec Uni.) (since 2016) 


Xiaoji Wang (since 2073) 


Research Admixture has been a common phenomenon 
throughout the history of modern humans, 
Overview as previously isolated populations often 
come into contact through colonization and 
Officially set up in January 2012 at the Partner migration. It is important to conduct a full 
Institute for Computational Biology, the analysis of genetic structure and characterize 
Max-Planck Independent Research Group the genetic make-up of admixed populations. 
on Population Genomics (PGG) focused on On the one hand, this will shed light on 
population genomics research of human human genetic history; on the other hand, 
admixture history and biological adaptation increased population admixture influences 
to local environment. Population Genomics is genome diversity, which in turn will affect 
a disciplinary to infer population genetic and phenotypes relevant to health; thus, genetic 
evolutionary parameters from genome-wide admixture has many implications in medical 
data sets. The ultimate goal of this research research. Our group is using computational 
group is to understand microevolution approaches and developing new methods 
mechanisms in human, while genetic to dissect genetic architecture of human 
admixture was taken as a cut-in point to populations, quantitatively characterize their 
pursue this ambition. admixture features, and reveal their migration 
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history and adaptive divergence. Specially, 
our group is working on several projects on 
theoretical modeling of human population 
admixture, statistical inference of human 
migration history and detection of footprints 
of natural selection in human genome. 
These projects are expected to advance our 
understanding of genetic admixture as an 
evolutionary driving force to facilitate human 


microevolution. 


Current State of Research 


Population Genomics plays a vital role in 
dissecting genetic architecture of complex 
traits/diseases by separating locus-specific 
effects from genome-wide effects, it is 

thus a bridge from evolution to medicine. 
Over the past decades, many joint forces 
based on international collaborations have 
made remarkable achievements in the 
studies of human genetic variation, such as 
Human Genome Project (HGP), the HapMap 
Project, Pan-Asian SNP Project, and the 1000 
Genomes Project. Nonetheless, considering 
very heterogeneous ethnic groups and large 
population size in our country, regional efforts 
are necessary to to provide a more precise 


and comprehensive characterization of the 
genomic diversity. In addition, high mobility 
of people in recent history and modern 
society considerably increased the chance of 
inter-ethnic marriages, or genetic admixture, 
which in turn influences genome diversity 
and further affects phenotypes relevant to 
health. A good news is that both US and 
Chinese the governments have launched 

the Precision Medicine Projects, which are 
expected to produce vast omics data on both 
individual and population level at increasingly 


faster rates in next 5-10 years. However, lack 
of adequate knowledge of genetic structure 
of populations increases the risk of failure of 
study design for those ongoing sampling 
expeditions supported or not supported by 
the Precision Medicine Project. 


Over the last two years (2015 and 2016), my 
group persisted in studying genetic admixture 
and focused our interest on developing 
methods for modeling human population 
admixture and inferring population history. 

As an extension of our previous research 
efforts in understanding recent population 
admixture such as that happened in Xinjiang’s 


Uyghurs, we were able to extend our methods 
into analysis of admixed populations with 
more ancient and complicated history. An 
outstanding example is Tibetan highlanders 
living at an elevation over 4,500 meters in 
Qinghai-Tibet Plateau. Here | am giving a very 
brief summary of some representative works. 


Reconstruct Genetic Origins and 
Population History of Tibetan 
Highlanders 


Current knowledge of the origin and 
population history of Tibetan highlanders is 
still very much in its infancy and controversial. 
Who are the Tibetans? How long have they 
been living at the Tibetan Plateau, the “Roof of 
the World”? Who were the early highlanders? 
Were they modern human or non-modern 
human species? Whether is there a genetic 
continuity, or just some continuity of culture, 
between the pre-historical populations and 
present-day Tibetans? These questions remain 
the most contentious puzzles in history, 
anthropology, and genetics. 
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We analyzed deep-sequenced genomes of 38 
Tibetan highlanders together with available 
data on archaic and modern humans, and 
comprehensively characterized the ancestral 
makeup of Tibetans and uncover their origins. 
Our analysis showed that Tibetans arose from 
a mixture of multiple ancestral gene pools, 

in particular, analysis of ~200 contemporary 
populations showed that Tibetans share 
ancestry with populations from East Asia 
(~82%), Central Asia and Siberia (~11%), South 
Asia (~6%), and western Eurasia and Oceania 
(~1%). The Tibetans and Sherpas show closest 
affinities to the surrounding highland groups 
such as Yizu, Tu and Naxi, followed by lowland 
Han Chinese. The divergence time between 
Tibetan and Han Chinese populations was 


estimated to be ~15,000 to ~9,000 years. 


We applied state-of-the-art methods and also 
developed a new method (ArchaicSeeker) to 
search for ancient ancestries in the genomes 
of Tibetan highlanders. We identified elevated 
archaic ancestry in Tibetans, and we dated 
the most recent common ancestors of the 
surviving archaic lineages in the Tibetan 
genomes back to ~60,000 — 40,000 years ago, 
predating the Last Glacial Maximum (LGM), 
much earlier than many previous studies 
assumed. Our results indicate that plateau 
colonization and the altitudinal adaptation of 
human beings were considerably earlier and 
more complicated than had previously been 


suspected. 


Figure 1. Summary plot of genetic admixture. The results of individual admixture proportions estimated from 592,799 auto- 
somal SNPs with genotype data available for 38 TIB, 39 HAN and 2,345 HuOrigin samples (African samples were not included). Each 
individual is represented by a single line broken into K = 7 colored segments, with lengths proportional to the K = 7 inferred clusters. 
The population IDs are presented outside of the circle of the plot. The results of population level admixture of TIB and HAN are sum- 
marized and displayed in the two pie-charts in the center of the circle plot with admixture proportions denoted as percentages and 


with different colors. 
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This study provides compelling evidence occurred among different hominine groups 
for the co-existence of both Paleolithic before the LGM. We suggested that the 
and Neolithic ancestries on a genome- highly differentiated sequences harbored 
wide scale in the modern Tibetan gene in highlanders’ genomes were most likely 
pool, which supports a genetic continuity inherited from pre-LGM settlers of multiple 
between pre-LGM highland-foragers ancestral origins, a genetically admixed group 
and present-day Tibetans. The Paleolithic which was named SUNDer, and maintained 
ancestries in the modern Tibetan gene pool in high frequency by natural selection. We 
entangle Denisovan-like, Neanderthal-like, further proposed a two-wave “Admixture of 
ancient- Siberian-like, and unknown archaic Admixture” (AoA) model to help explain the 
sequences, indicating that Tibet remained ancestral make-up and pre-history of Tibetans 
a human melting pot where interbreeding and Sherpas. 
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Figure 2. A sketch-map for the origins and demographic history of Tibetans and Sherpas. A simplified model for the 
origins and evolutionary history of Tibetans and Sherpas based on the observations and estimations from this study. The ancient 
gene pool of the Tibetans was originated from an ancient admixed population, SUNDer, which was a group of hybrids of ancient 
Siberians (modern human) and several archaic populations—including Denisovan-like, Neanderthal-like, and likely a few unknown 
non-modern human groups that currently have not been identified by archeological or genetic studies. The admixture events 
which eventually formed SUNDer could have occurred on the Tibet Plateau, or in lowland areas before the SUNDer arrived at the 
Tibet Plateau at least ~40,000 YBP (dated from data with TMRCA estimation of archaic DNA sequences in the Tibetan and Sherpa 
genomes), before the Last Glacial Maximum (LGM) (~26,500 — 19,000 YBP). Between ~40,000 and ~15,000 YBP, few new migration 
occurred between the lowland to the plateau due to LGM. However, since about ~15,000 — ~9,000 YBP (based on the divergence 
time estimated from HAN and TBN), there was a second wave of migrations to the plateau from the lowland that included modern 
human ancestry, likely a population split from the common ancestor of Tibetans and Han Chinese. The divergence of Tibetans and 
Sherpas occurred ~11,000 — ~7,000 YBP (based on MSMC analysis). MRCAO: most recent common ancestor of modern human and 
archaic hominoids; MRCAI: most recent common ancestor of Eurasians; MRCA2: most recent common ancestor of HAN and TIB; 
MRCA3: most recent common ancestor of TBN and SHP. The two dashed lines connecting HAN and TBN, and TBN and SHP represent 
possible gene flow occurred between populations. 
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Figure 3. A snapshot of web page of Wikipedia on Tibetan people. 


The study entitled “Ancestral Origins and Genetic Our findings were also reported by a 
History of Tibetan Highlanders" was online full- news entitled “Ice Age Tibetans” in 
published in American Journal of Human Scientific American (March 2017), 316, 14-16. 
Genetics 99, 580-594 on September 1, 2016. See below for a scan copy of the news. http:// 
Many of the results and conclusions from our www.nature.com/scientificamerican/journal/ 
study were immediately cited by Wikipedia v316/n3/full/scientificamerican0317-14.htm! 


https://en.wikipedia.org/wiki/Tibetan_people. 
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Method 1: Admixinfer project and the Human Genome Diversity 
Project. 


Length Distribution of Ancestral Tracks 
under a General Admixture Model and 
Generation 0 


Its Applications in Population History 
Inference Generation 1 


We proposed a general admixture model that Senecetioné 
covered all possible admixture scenarios and 
deduced the length distribution of ancestral 


Generation T-1 


tracks, thus providing a theoretical framework 


for reconstructing the population admixture Swoaretiony 

history. In this study, we first described the 

general admixture model and deduced a Figure 4. The general admixture model. Here we illus- 

general formula for the theoretical distribution trated an admixed population with ancestral populations, 
f which started to admix generations ago. The gene flows 

of ancestral tracks with some reasonable from each ancestral population could be zero at a specific 


generation. POP represents the ancestral population. 


approximations. With this distribution, we can 


use maximum likelihood estimation (MLE) to 

estimate model parameters, and the Akaike i 

information criterion (AIC) or the likelihood Method 2: iMAAPs 

ratio test (LRT) to select an optimal model 

from candidates for the given data. We next Inference of Multiple-wave Population 
demonstrated that the three aforementioned Admixture by Modeling Decay of Linkage 
admixture models, namely HI, GA and CGF Disequilibrium with Polynomial Functions 
models in previous studies are all special 

cases under our general model. Then, under To infer the histories of population admixture, 
these three models, we developed a method one important challenge with methods based 
called Admixlnfer to estimate the admixture on the admixture linkage disequilibrium 
proportion and admixture time, and (ALD) is to remove the effect of source LD 
simultaneously selected the optimal model (SLD), which is directly inherited from source 
according to the principles of AIC. Simulations populations. In previous methods, only the 
demonstrated that our methods could infer decay curve of weighted LD between pairs 
the model and parameters with high accuracy. of sites whose genetic distance were larger 
Moreover, our methods were insensitive than a certain starting distance was fitted by 
to demographic history, sample size and single or multiple exponential functions, for 
threshold to discard short ancestral tracks. the inference of recent single- or multiple- 
Finally, good performance was also observed wave admixture. However, the effect of SLD 
when Admixlnfer was applied to some real has not been well defined and no tool has 
datasets of African Americans, Mexicans and been developed to estimate the effect of 


South Asian populations from the HapMap SLD on weighted LD decay. In this study, we 
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defined the SLD in the formularized weighted the performance of iMAAPs under various 

LD statistic under the two-way admixture admixture models in simulated data and 
model, and proposed a polynomial spectrum applied iMAAPs to the analysis of genome- 
(p-spectrum) to study the weighted SLD and wide single nucleotide polymorphism data 
weighted LD. We also found that reference from the Human Genome Diversity Project 
populations could be used to reduce the (HGDP) and the HapMap Project. We showed 
SLD in weighted LD statistics. We further that IMAAPs is a considerable improvement 
developed a method, iMAAPs, to infer over other current methods, and further 
multiple-wave admixture by fitting ALD facilitates the inference of histories of complex 
using a polynomial spectrum. We evaluated population admixtures. 
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Figure 5 . P-spectrum for in a simulated admixed population. The observed weighted LD decay (gray points 
in top right) are fitted by hundreds of polynomial functions (gray curves in the bottom panel, each curve connect- 
ing to the position represents the decay of the function, with the value ranging from 0 to 0.7 Morgan), and a few 
of them whose coefficients are positive (highlighted in heat color). The amplitudes for each positive coefficient are 
plotted along the value of I (generations ago) in the top left. 


Method 3: CAMer induced linkage disequilibrium (LD) to 
infer continuous admixture events, which 


Modeling Continuous Admixture is common for most existing admixed 


Using Admixture-Induced Linkage populations. Unlike previous studies, we 


Disequilibrium expanded the typical continuous admixture 
model to a more general admixture scenario 
To understand the complex population with isolation after a certain duration of 


admixture process, here we used admixture continuous gene flow. Based on the extended 
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models, we developed a method based 

on weighted LD to infer the admixture 
history considering continuous and complex 
demographic process of gene flow between 
populations. We evaluated the performance 
of the method by computer simulation 

and further applied our method to real 

data analysis of a few well-known admixed 


populations. 
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Figure 6. Evaluation of CAMer under various simulated 
admixture models. Here, the core models are HI, GA-I, CGF1- 
|, and CGF2-I. The simulated models (True Model) are listed 
on the left, with the admixture time interval depicted in the 
parentheses. The gray area on the middle vertical panel is the 
simulated time interval, whereas colored lines indicate the 
estimated time intervals under different core models. HI: pink; 
CGF 1-I: green; CGF2-I: purple; GA-I: blue. The intensity of lines 
means the number of points covered by the time intervals 
estimated from all jackknives. 


Identify a Tibetan Specific Copy Number 
Deletion that may Account for High- 
altitude Adaptation 


High altitude adaptation (HAA) of Tibetan 
highlanders has been studied extensively and 
many candidate genes have been reported 
based on analysis of single nucleotide 
polymorphism (SNP) data. Among many 
reported HAA candidates, a hypoxia pathway 
gene, EPAS1 is the top gene identified by 
most of previous studies as having the most 
extreme signature of positive selection in 
Tibetans. Subsequent efforts targeting to 
identify functional variants, however, so far 
have been not that successful. 


We developed a new method (WinXPCNVer) 
particularly for detecting population specific 
copy number variations, and identified a 3.4- 
kb copy number deletion near EPAS1, which is 
significantly enriched in high-altitude Tibetans, 
namely TED. About 90% of Tibetans carry this 
TED and 50% lose both copies, but only 3% 

of 2,792 worldwide samples are TED carriers 
and no homozygous deletion carriers were 
found in non-Tibetan samples. We further 
explored, by analyzing database and literature, 
functional potentials of the TED and found 
there is enhancer histone mark overlapping 
with this TED and the TED is associated with 
homoglobin concentration in Tibetans. We 


also whole-genome deep sequenced seven 
Tibetans and verified the TED but failed to 
identify any other copy-number variations 
with comparable patterns, giving this TED top 
priority for further study. 
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Figure 7. Tibetan-enriched deletion downstream of EPAS1. (A) Population structure of Tibetan samples from different 
sources (Qinghai: TIB1; Tibet: TIB2, TIB3 and TIB-seq). The principal component analysis (PCA) plot was generated by 99,768 
genome-wide random SNPs. Each dot represents one Tibetan individual. The x-axis and y-axis represent the first and the 
second principal component (PC), which explains 12.02% and 10.23% of the total variance, respectively. (B) Genome-wide 
distribution of VST-w calculated as the mean VST of the top three probes in each 3 kb-sliding window. The red vertical line 
represents the Tibetan-enriched deletion downstream of EPAS1. (C) Read depth (RD) of seven Tibetan, two Sherpa, one 
Neanderthal, one Denisovan and five modern human individuals. The deletion region is highlighted in the blue bar at the 
bottom. Samples with homozygous and heterozygous deletion showed a zero and half-level of the normal (flanking) RD, 
respectively. Four Tibetan and two Sherpa individuals carried homozygous deletion and the other three Tibetan individuals 
carried heterozygous deletion. No deletions were found in other individuals. (D) Diagram of locations of microarray probes, 
long PCR primers and EPAS1 position. 
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Figure 8. Distribution of the TED frequency among populations and correlation with altitude. (A) Deletion 
frequency distribution in Asian populations. Colors from yellow to red indicate the frequency from low to high. Each blue 
triangle represents a sampled population. (B) Deletion frequency correlated with altitude in Asian populations (population 
information is listed in Table 2); R2=0.958. (C) Deletion frequency correlated with altitude in five Tibetan sub-groups (Lhasa, 
Nyingchi, Qamdo, Shannan and Shigatse); R2=0.989. 
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Despite the function of the TED has not 
been fully characterized yet, many lines of 
evidence support that the TED is a promising 
candidate that might have played a critical 
role in high altitude adaptation of Tibetans. 
We suggested additional experimental studies 
are still needed to verify the functional role 
of the TED in adaptation of Tibetan people 
to highland. The findings on this TED opened 
a new window to elucidate the functional 
mechanism of EPAS1 which is expected to 
eventually understand the molecular basis of 
HAA in Tibetans. 


The study entitled “A 3.4-kb Copy-Number 
Deletion near EPAST Is Significantly Enriched 

in High-Altitude Tibetans but Absent from 

the Denisovan Sequence" was published in 
American Journal of Human Genetics 97, 54- 
66 on June 11, 2015. 
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Future Perspective 


In the next two years or for a longer time, our 
particular aim is to explore the effects of 
population admixture on genetic diversity 
and phenotypic diversity. This ambitious 
project was first proposed two years ago, 

now it is an on-going project with promising 
progresses. 


With extensive collaborations, recently, my 
group has expanded its research to most of 
hot areas of population admixture in Eurasia, 
including Middle East (Yang et al., J.Hum. 
Genet., 2014), Central Asia (Lou et al, Eur. J.Hum. 
Genet., 2014; Li et al., J. Med.Genet., 2014; Lou et 
al., Am.J.Hum.Genet., 2015; Lu et al., Am.J.Hum. 
Genet., 2016), South Asia and Southeast Asia 
(Xu et al., PNAS, 2012; Deng et al., Hum.Genet., 
2014; Hoh et al., Hum. Genom. 2015) 
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Figure 9. A framework for population admixture research in PGG. 
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Hot areas of population admixture in Eurasia 


Figure 10. Highlighted in red color indicates areas where human populations are generally admixed and quite a few 
studies have been conducted over the last couple of years or some collaborative projects led by PGG are ongoing. 


Key questions. To pursue our ambition 

and reach our goal, we attempt to answer 

the following key questions: (i) what is the 
influence of admixture on population genetic 
diversity and individual genetic make-up? 

ii) what is the influence of recombined 
ancestral) alleles (via admixture) on 
transcriptomic profiles ? (iii) what are effects 


of genetic admixture on skin-color (with 
collaboration with Dr. Yajun Yang and Prof. 
Li Jin), skin characteristics (with collaboration 
with Dr. Sijia Wang), facial morphology (with 


collaboration with Dr. Kun Tang). 


Study design and some preliminary 
progresses. The study design of this 
research plan is closely following the research 
framework as shown above. By collaborating 


local researchers in Xinjiang (north-western 
China), we have already completed sample 
collection and partly data generation in 400 
Uyghur individuals. Some more detailed 
information is displayed below in Figure 

11. This big project was first proposed two 
years ago, now it is a on-going project with 
pretty promising progress. It took us about 
two years to collect all of the 1,000 Uyghur 
samples (including the previous 400 samples) 
and generate genome-wide single-nucleotide 
variation data in all of the samples. In addition, 
we also chose 90 Uyghur samples for whole- 
genome deep sequencing to evaluate the 
influence of population admixture on genetic 
diversity of the individual genomes as well as 
the functional genes (e.g. analysis of genetic 
load). 


—, Peripheral blood 
400 samples 
(DNA + RNA) 


Microbiome 

Oral (saliva) 
Gastro-intestinal (stool) 
400 samples 

(DNA) 


Facial morphology 
3D imagines 

400 samples 

(Dr. Kun Tang) 


Skin characteristics 
Dermatological indices 
400 samples 

(Dr. Sijia Wang) 


MAX-PLANCK INDEPENDENT RESEARCH GROUPS 


Blood biochemical profile (10 indices) 
DNA genotyping (Illumina Zhonghua) 


DNA sequencing (HiSeq X10) 
RNA sequencing (HiSeq2000) 


16S ribosomal RNA (MiSeq) 
Metagenome (HiSeq2000) 
Skin color 
400 samples 
(Dr. Yajun Yang) 


Epidemiological data 
Diet, smoking, 
alcohol... 

Body mass index 
400 samples 


Figure 11. Information of molecular data and phenotypic data collected in Xinjiang’s Uyghurs samples. 


While we have been preparing for this big 
project over last two years, we also made a lot 
of research efforts on another ethnic group 
which are also living a region close to the 
Xinjiang’s Uyghurs, i.e. Tibetan highlanders, as 
we have presented some progresses above. 
An outstanding finding from the studies of 
Tibetan highlanders was that we identified 
substantial genetic contribution from very 


ancient or archaic human groups/species to 
the present-day populations. Our preliminary 
analysis of the Uyghur's data also gave us 
similar impression. Therefore, in the next two 
years, we will not only persist in analysing 
admixture in the context of modern human, 
but also exploring the effects of those archaic 
DNA segments in a broader context with 
ancient/extinct human DNA sequences being 
integrated into the analysis. To peruse this aim, 
we have already developed some methods 
for detecting archaic human DNA sequences 
in modern human genomes, for example, 


we developed ArchaicSeeker (Figure 12) 
which has already been applied to studies on 
Tibetan highlanders. 


Expectation. This ongoing project is 
expected to provide us not only wonderful 
data for genotype-phenotype association 
studies, but also a nice model for studying 
the mechanism of genetic admixture driving 
dynamic changes of genetic diversity 

and human complex traits. The methods 
developed and pipeline established in this 
project are expected to facilitate future studies 
on other type of evolutionary driving forces 
and could be applied to other complex traits/ 
diseases. 
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ArchaicSeeker: Search for archaic lineages 
in modern humans 


African Non-African 


http: .picb.ac.cn resource, 


Lu et al., AJHG 2016 


— maene tte anna agent 
— Ne egoere + fonn 


Figure 12. ArchaicSeeker is a more heuristic method to detect the archaic DNA sequences in 
present-day human genomes. The principle of this method is based on the high divergence between 
modern humans and archaic hominin, and the finding that archaic hominin introgressions are absent 


in sub-Saharan Africans. More details can be refered to Lu et al., 2016. 
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Oct 08-12, 2014. 


+ Fellow, 5th Exploratory Round 

Table Conference on “Personalized 
Medicine: From Risk Factors to Disease 
Predispositions”, Shanghai, China, Jun 14-16, 
2014. 


+ The Population Genomic Landscape of 


Human Genetic Structure, Admixture History 
and Local Adaptation in Peninsular Malaysia, 
Human Genome (HUGO) Meeting 2014, 
Geneva, Swiss, Apr 27-30, 2014. 


- Plenary Speaker, Population Genomics 


of Human Migration and Local Adaptation, 
Workshop on Human Whole Genome Re- 
sequencing Data Analysis, Sabah, Malaysia, 
Mar 10-14, 2014. 
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Shuai Li (Until June 2017) 


Research 
Overview 


Skin is the largest organ of human body and 
the first defence against the environment. 
Adaptation to the local environment through 
natural selection brings substantial skin 
variations within and between human 
populations. Like any other complex traits, skin 
variations mostly result from a combination 

of intrinsic and extrinsic causes, but the 
underlying genetic and environmental factors 
are largely undetermined. Our research aims to 
reveal these underlying factors by developing 
and integrating computational approaches, 
and by leveraging available genomic resources 
from large-scale cohorts. Specifically, we have 
three research foci: 


1, Developing and integrating genomics 
approaches to identify causal variants 
underlying normal skin variations. 


2. Modelling and evaluating the impact of 


gene-environment interaction on skin aging. 


3. Investigating the adaptation of skin-related 
traits in the context of human evolution. 


Background 


Normal skin variations are generally 
understudied. For decades, skin research 

has been focusing on clinically relevant skin 
diseases. Skin color is the only well-studied 
normal skin variation, probably because 

it is the most conspicuous trait and varies 
significantly between and within populations. 
Skin color also has high heritability, ranging 


from 0.5 to 0.8 in reported studies, and can be 
quantified by reflectance spectrophotometry. 
These factors make skin color an ideal subject 
to explore in a genome-wide scan framework. 
In our research, we propose to focus on other 
understudied normal skin variations that 

also have significant ethnic differences, high 
heritability, and functional implications. 


A genomic scan in a large-scale cohort can 
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be an effective approach. Facilitated by the 
continuously decreasing cost, genome-wide 
association studies (GWAS) have been widely 
used to discover genetic variants underlying 
diseases or physical traits. The success of 
GWAS depends on many factors, including the 
choice of phenotype, choice of population, 
density of the scan, sample size, etc. The use of 
well-established cohorts for GWAS can greatly 
reduce the cost, as many phenotype data can 
be collected at the same time together with 
environmental data and DNA samples. Such 
collection is extremely time-consuming and 
logistically challenging, and would not be 
feasible in a normal field collection setting. 

In our research, we collect phenotype and 
genotype data as well as environmental data 
from the Taizhou longitudinal cohort in China. 
We further collect data from the TwinsUK 
cohort in the UK, the Rotterdam Study in 
Netherland, and the CANDELA cohort in Latin 
America through collaborations. 


Most of the skin phenotypes are affected 
by both genetic and environmental factors. 
Taking the example of skin aging related 
phenotypes, such as wrinkle and pigmented 
spots, they have been linked to chronic 
exposure to solar radiation and cigarette 
smoke, and most recently, to industrial 

and traffic-related airborne particles in a 
Caucasian population. However, skin aging 
varies strikingly among individuals and 
among ethnic populations. The biological 
mechanism behind extrinsic skin aging could 
be complicated and involves interactions 
between genes and environments. Therefore 
it is extremely important to explore both 
genetic and environmental factors, as well as 
the interaction between them, when studying 


skin phenotypes. 


The genetics of normal skin characteristics 
usually have strong evolutionary 
significance. Having originated from 

Africa more than 100 thousand years ago, 
modern humans quickly occupied different 
environmental niches all over the world 

in a relatively short period of time. Local 
adaptation has been shaping us into an 
extremely diverse species with a great deal 
of phenotypic variations. Skin, as the first 
contact to the environment has experienced 
extensive adaptive selection in the past tens 
of thousands years. The lightened skin color 
in Caucasians, for example, has been widely 
seen as an adaptation to compensate vitamin 
D deficiency. It was caused by a number of 
genetic variants independently. Many other 
genes affecting normal skin characteristics are 
also subject to strong selective adaptation, 
but the mechanisms and the evolutionary 
significance remain to be investigated. 


Current State of Research 


1. Genes underlying normal skin 
characteristics 


By sampling ~3000 individuals in the Taizhou 
longitudinal cohort, we collected phenotypic 
data, together with information of health 
status, life style and environmental exposure. 
We took 2D and 3D photos for image analysis, 
collected hair and blood samples, and 
obtained genome-wide data (~890K SNPs). We 
performed strict quality control on the data, 


by consistently using double entry and double 
scoring. We generated various regression 


models to infer the underlying factors 
associated with the phenotype of interest. We 
developed a pipeline for the GWAS, including 
quality control, association, presentation, 
prioritization, annotation, evolutionary and 
functional interpretation. Below is a summary 
of major findings so far. 


1.1. Genetic variants at 9q34.3 are associat- 
ed with skin barrier functions. Skin barrier 
function, measured by transepidermal water 


~bgolp) 
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loss (TEWL) is affected by a number of envi- 
ronmental factors, and by genetics. We per- 
formed the first genome-wide scan for genes 
associated with basal TEWL in 977 Han Chi- 
nese. Our study identified a significant signal 
at 9934.3 (p=8.16x10""), close to the Ficolin 

1 (FCNT) gene (Figure 1; Zhang et al., 2017). 
Interestingly, the allele frequency of the signal 
SNP fits well with the fact that Africans have 
superior skin barrier function than Europeans 
and Asians. 
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Figure 1. GWAS of TEWL found significant association with chromosome 9q34.3. a) The diagram of TEWL cited by Kao 
Corporation (http://www.kao.com/). b) The effect of the top SNP at 9q34.3 on TEWL. The G allele is associated with lower TEWL score 
on face, meaning better skin barrier function. c) Manhattan plot showing the results of meta-analysis of 611 Taizhou Samples and 


366 Taixing samples after adjusting gender, age, use of skincare, temperature and humidity. 


1.2. Genetic variants at 3q26.2 and 18q23 
are associated with fingerprint patterns. 
The fingerprint pattern has a very high 
heritability estimation (h?=0.65-0.96), 
suggesting a genetic basis. A genome-wide 
scan on the pattern of fingerprint in 2,907 Han 
Chinese found significant signals at 3q26.2 
(p=5.04x107!), near EVI] (Ecotropic virus 
integration site 1), critical in the development 
of forelimbs and fingers in humans, and 
18q23 (p=1.98x10"7), near SALL3 (Spalt-Like 
Transcription Factor 3), a key regulator of limb 
development at early stages. Interestingly, the 
3q26.2 singal is associated with digits 2,3 and 
4; while the 18q23 signal is associated with 
digit 5, suggesting the fingerprint pattern of 


different fingers could be affected by different 
genes. We further found evidence that the 
signal SNPs showed enhancer feature at 
3q26.2, and the Evil transgenic mice (Junbo) 
showed phenotypic changes in caterpillar 
pads, a potential equivalent to human 
fingerprint patters (Figure 2). 


1.3. Regulatory variants Influence Eyebrow 
Thickness. While humans have lost most 

of the facial hair, the eyebrow is one of the 
exceptions. Few genomic studies have 

been conducted to explore its genetic basis. 
Here we identified three new genome- 
wide significant loci at 3q26.33 near SOX2 
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Figure 2. Variants associated with fingerprint patterns with function validations. a) Manhattan plot showing 
significant signals at 3q26.2 and 18q23 associated with various fingerprint patterns. Note that fingerprint patterns of ten fingers 
are not independent to each other and the “block structure’ is also consistent with the GWAS results. b) Variants at 3q26.2 
showed enhancer feature that affects transcriptional activity of EVI1 gene, supported by a luciferase reporter assay in 293T cell 
(similar pattern found in Kasumi-1 cell). c) The Evil transgenic mice (Junbo) showed phenotypic changes in caterpillar pads, a 
potential equivalent to human fingerprint patterns. Significantly increased number of caterpillar pads occurred in wild types 
than in heterogyzotes for D2/D3/D4 in mice. 


(p=1.11x10""), 5q13.2 near FOXD1 (p=2.52x10" 
13) and 2q12.3 (p= 5.81x10"') near EDAR. All 
three signals are located at potential enhancer 
regions of nearby genes, respectively. Further 
functional genetic studies exemplified 

that the SNP rs1345417 indeed affects the 
transcriptional activity of the nearby SOX2, a 
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gene involved in hair development (Figure 
3). We also found that none of the associated 
variants showed clear selection signals in any 
of the population tested, suggesting that 
unlike the popular speculations, eyebrow 
thickness may not be subject to strong 
selection pressure. 
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Figure 3. Variants associated with eyebrow thickness with functional validations. a) Manhattan plots illustrating 
results of the transethnic meta-analysis in 2,961 Han Chinese samples, 721 Uyghur and 2,301 Latin Americans. The top 4 PCs, 
gender and age were used as co-variants. b) Expression analysis of different Luciferase reporters in the human A375 cell. c) qRT- 
PCR analysis for relative SOX2 expression in the CRISPR-Cas9 edited A375 cell clones, indicating that the SNP rs1345417 indeed 
affects the transcriptional activity of the nearby SOX2. 
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1.4. A novel variant at OCA2 associated 
with eye color in East Asians. Previous 
studies have identified genetic variants in 
several genes associated with human eye 
color in Europeans, but there is no report of 
eye color related variants in East Asians. We 
performed a GWAS on eye color in 2,938 Han 
Chinese. We found two non-synonymous 
SNPs in OCA2 independently associated 
with eye color (rs1800414: p=5.34x10 8, 
1574653330: p=3.39x10»). The findings 
were replicated in a second GWAS on 690 
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Uyghur samples. Interestingly, we discovered 
a strong interaction between rs74653330 
and rs12913832, a previously reported 
pigmentation related variant affecting the 
expression of OCA2. We also demonstrated 
that despite of the subtle differences of eye 
color in East Asians, a multinomial logistic 
regression model could also predict eye color 
in Uyghur (AUG@0.77 for Dark brown, 0.72 

for light brown and 0.92 for green surround 
brown). 
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Figure 4. Variants at OCA2 associated with Eye color. a) Martin-Schultz eye color disk with 20 sub-categories. b) 
Manhattan and quantile-quantile plots illustrating the results of the GWAS in 2,926 Han Chinese samples adjusted for 
rs1800414, top two PCs, gender and age. c) Isofrequency maps of rs74653330. The contour map illustrating the geospatial 
distribution of rs74653330-T allele across the world. d) The interaction between rs12913832 and rs74653330 in Uyghurs. e) 


The prediction results of eye color in Uyghurs. 


2. Gene, environment and their interaction 


affecting skin aging 


The biological mechanism behind 
skin aging could be complicated and 
involves interactions between genes and 


environments. In order to obtain a clear 
picture of the genetics and environmental 
impact on skin aging in Chinese populations, 
we have specifically collected: 1) skin aging 
scores (SCINEXA); 2) life style factors; 3) air 
pollution exposure estimates (inferred from 
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readings of close-by monitoring stations); and 
4) blood and serum samples. Additionally, 
we have collected indoor air pollution 
measurements (PM, 5, SO, NO», O3, and VOC) 
in 36 households. Below is a summary of 
major findings so far. 


2.1. Genetic variants at 5p15.33 are 
associated with skin aging signs (pigmented 
spots on hands/arms). Skin aging signs 

are usually correlated with each other, and 


there could be universal factors underlying 
correlated aging signs. Using a partial least 
square path model, we found the latent 
variable related to pigmented spots on hands/ 
arms is strongly associated with variants at 
5p15.33 (p=1.51x10~), in which the Telomerase 
Reverse Transcriptase (TERT) gene is located 
(Figure 5). This study was performed in 2,964 
Han Chinese samples, and was replicated in a 
Caucasian population in Germany, suggesting 
a universal mechanism across ethnic groups. 
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Figure 5. Variants at 5p15.33 associated with latent variable underlying pigment spots on hands 
and arms. The latent variable (LV) was extracted from three single phenotypes including pigment spots on 
arm (S1), pigment spots on back of hands (S2), and uneven pigmentation on forearm (53). Sex and age were 


used as co-variants 


2.2. Air pollution exposures are associated 
with manifestation of skin aging. A recent 
study in German populations found the 
particle matter (PM) exposure is associated 
with skin aging, particularly wrinkle related 
signs. However, no such studies had yet been 
done in China, where the air pollution problem 
is much more severe. In two independent 
Chinese populations (Pingding and Taizhou), 
we found that the use of fossil fuel accelerates 


skin aging, particularly wrinkle related signs 
Figure 6a; Li et al., 2015). We further found 
that indoor PM; s is directly associated with 
wrinkle and laxity related skin aging signs 
Figure 6b). Furthermore, we firstly reported 
that the gas pollutant (NO, and SO;) are 
associated with skin aging signs, particularly 
entigines on cheeks, suggesting PM and gas 
might be affecting skin aging from different 
mechanisms (Figure 6c; Huls et al, 2016). 
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Figure 6. Air pollution exposure associated with skin aging signs. a) Use of fossil fuel associated with significantly 


more severe skin aging phenotypes. Data collected from 858 female samples in Taizhou, China. Skin aging phenotypes 
are described by arithmetic means (AM) with 95% CI. * P<0.05, ** P <0.01, ***P <0.001 (Kruskal-Wallis rank sum test). b) 
The skin aging manifestations are significantly more severe in samples with higher indoor PM, exposure. Indoor PM; 5 
was measured in 30 households to construct prediction models inferring the indoor PM, ¿exposure in 2,428 samples 

(f =0.67). Age, gender, BMI, sun exposure time and education level were used as control variables. c) Association between 
air pollution and relative amount of lentigines in elderly women in Taizhou, China (N=522). An increase of 10ug/m? in NO, 
was associated with 24% more lentigines on cheeks (P<0.001) and an increase of 10ug/m? in SO, was associated with 18% 
more lentigines on cheeks (P<0.05). Age, BMI, smoking history, passive smoking, education level, sun exposure time and 


type of fuel were used as control variables. 


2.3. Genetic variants in AHRR gene modifies 
the effects of smoking on skin aging. While 
smoking is already an established risk factor for 
skin aging, its biological mechanism affecting 
skin aging is still open to investigation. There 
are multiple hypotheses, and therefore it is 
interesting to see if there are genetic variants 
in the candidate genes that could modify 

the effects of smoking on skin aging. Here 

we found that a haplotype of 5 SNPs in aryl- 
hydrocarbon receptor repressor (AHRR) gene is 
significantly changing the effects of smoking 
on nasolabial folds (Figure 7). 


Nasowbial folds 


Maptotype 
CTTGA 


= tocag 


Smoting pack year 


Figure 7. Variants modify effects of smoking on skin 
aging. Five SNPs at AHRR are forming a haplotype (CTTGA) 
in strong LD, offsetting the effects of smoking on the 
development of nasolabial folds (P=1.0x107). Smoking effects 
were measured by smoking pack year. The grey areas are the 
95% confidence interval. 
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3. Adaptation of skin-related traits in 
human evolution 


Our previous finding indicated that a strongly 
selected variant FDARV370A appeared in 
current day central China more than 30,000 
years ago, and swept to high frequency ina 
short period of time because of its selective 
advantage. We have further looking into its 
phenotypic implication and the pleiotropic 
effect in the context of evolution. 


3.1. EDAR is the predominant genetic factor 
affecting hair straightness in East Asians. 
Recent genome-wide scans had identified 
several genes that affecting hair straightness 
in Caucasians and Latin Americans, but no 
genome-wide scan had been performed in 
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East Asians. Here we found that EDARV370A 
(rs3827760) plays a predominant role affecting 
hair straightness in Han Chinese (P=4.67x1076) 
and Uyghurs (P=1.92x10"?). Also, we found 
that EDARV370A had a greater effect on hair 
straightness than rs11803731 (TCHH, the gene 
that affecting hair straightness in Caucasians) 
in Uyghurs, and there is no interaction 
between them, suggesting that these two 
genes affect hair straightness through 
different mechanisms (Figure 8; Wu et al., 
2016). Haplotype analysis indicates that TCHH 
is not subject to selection. Although EDAR is 


under strong selection in East Asia, it does 
not appear to be subject to selection after the 


admixture in Uyghurs. These results suggest 
that hair straightness is unlikely a trait under 
selection. 
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Figure 8. EDARV370A plays a predominant role affecting hair straightness in East Asians. a) 
Manhattan plot showing a GWAS on hair straightness in 2,961 Han Chinese. b) Bar plot showing that 
rs3827760 is associated with hair straightness. With more G alleles, the proportion of straight hair increases. 
c) Geographical distribution of allele frequency for the EDAR SNP (rs3827760) and the TCHH SNP (rs11803731) 


based on the Human Genome Diversity Project. 
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3.2. EDARV370A is associated with ectodermal development, EDAR is a great 
multiple ectodermal related phenotypes. example for pleiotropy. We are able to 
Pleiotropy is the phenomenon that a gene is illustrate its pleiotropic effect in the Uyghur 
simultaneously affecting multiple phenotypes. population (Figure 9; Peng et al., 2016). 


As a gene played an important role in the 
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Figure 9. EDARV370A significantly associated with hair straightness, shape of earlobe, type of chin and 
incisors shoveling in a Uyghur population (N=1,027). EDARV370A associated with A) an increased frequency 
of straight hair (P=4.07x1 076); B) an increased frequency of triangular shape of earlobe (P=3.91x 10%); © an increased 
frequency of retrude type of chin (P=4.75x10*); D) an increased frequency of incisors shoveling (P=1.76x10°°). 
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Future Perspectives 


Quantitative and high-throughput 
phenotyping through image analysis 


Accurate and efficient methods to quantify 

phenotype are vastly important for genome- 
wide scans on complex traits associated with 
genetic variants with minor effects. While 
large sample sizes are required for such 


studies, automatic phenotyping is highly 
desirable. Using image analysis, we have 
already developed semi-automatic pipelines 


for sweat gland and hair counting, and for 
wrinkles and pigmented spots detection. We 
are further developing deep learning based 
model for the image analysis to achieve better 
accuracy. 


Systematic analysis including 
transcriptome and epigenome data 


Genome-wide expression differentiation 

and epigenome-wide analysis have been 
effectively applied to illustrate the genetic 
variations in different skins. On top of our 
existing data (skin phenotypes, environmental 
factors, and genomic data), it would be 
extremely valuable to further collect the 
transcriptome and epigenome data in the 
dermatogenomics research, as these are the 
critical intermediate layers connecting the 
gene-environment-phenotype network. We 
are ready to make this expansion in the next 
two years. We are able to secure both the 
skin biopsy samples (through collaboration 
with dermatologists), and additional funding 
to cover the cost (through collaboration with 
Companies). The systematic network analysis 


combining different omics data will bring new 
knowledge in skin research. 


Explore the interplay between skin 
microbiome and skin characteristics 


Skin microbiome plays crucial roles in skin 
health. While the standard practice of gut 
microbiome studies is already very mature, 
the standard practice of skin microbiome 
studies is still in its infancy. By collaborating 
with the pioneering laboratories in the field 
of skin microbiome, we will try to establish 
the correlations between skin microbiome 
(first diversity, and then individual quantity) 
and skin characteristics. We can further test 
the causality by carrying out interfering 
experiments on the skin microbiota. 


Improve prediction power of a phenotype 
by using the information of other 
associated phenotypes 


For the common traits we are studying, it 
would be ideal if prediction models can be 
made to infer the traits/phenotypes — as 
what has already been done on the eye color. 
However, for many common traits, it is not 
practical to build prediction model based on 
genetic information alone. Since we can find 
significant amount of correlation between 
phenotypes - and there are often biological 
explanations behind such correlations — 

we will be able to improve the prediction 
model by taking advantage of the phenotype 
correlations. We have already found that 

by facial information alone, we can build a 
prediction model to separate people with 
high blood pressure from the normal ones 
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(AUG@0.75). We will further explore this area 
by carrying out a more systematic study 
collecting a large amount of phenotypes 
(including image and physiology phenotypes, 
as well as the molecular phenotypes) at the 


same time. 
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+ Manfei Zhang”, Sijie Wu*, Juan Zhang*, Ya- 


jun Yang*, Jingze Tan, Yu Liu, Haijuan Guan, 
Kun Tang, Jean Krutmann, Shuhua Xu, Li Jin, 
Yaqun Guan’, Hui Li’, Sijia Wang". Large- 
scale genome-wide scans do not support 
petaloid toenail as a Mendelian trait. J Gen- 
et Genomics, 2016 Dec 20;43(12):702-704. 


+ Manfei Zhang*, Bingjie Li*, Sijie Wu, Jing- 
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ze Tan, Yajun Yang, Alessandra Marini, An- 
drea Vierkotter, Juan Zhang, Hui Li, Tamara 
Schikowski, Li Jin, Jean Krutmann, Sijia 
Wang‘. A genome-wide association study 
of basal transepidermal water loss finds 
variants at 9q34.3 to be associated with skin 
barrier function. J Invest Dermatol, 2017 
Apr; 137(4):979-982. 


+ Wenshan Gao%, Jingze Tan*, Anke 
Huls*, Anan Ding, Yu Liu, Mary S. Matsui, 
Andrea Vierkotter, Jean Krutmann, Tamara 
Schikowski', Li Jin’, Sijia Wang’. Genetic 
variants associated with skin aging in the 
Chinese Han population. J Dermatol Sci, 
2017 Apr; 86(1):21-29. 


Cooperation 


+ Genetic analysis of facial morphology. Prof. 
Li Jin, Fudan University, China. 


+ The evolution and genetic mechanisms of 
pigmentation related phenotypes in East 
Asian and European Populations. Dr. Fan 
Liu, Beijing Institute of Genomics, CAS; Dr. 
Hong Shi, Kunming University of Science 
and Technology, China. 


e The genetic study of craniosynostosis. Prof. 
Xiongzheng Mu, Huashan Hospital, China. 


+ Fingerprint patterns in Leukemia. Dr. Jun- 
hong Song, Shanghai Children’s Medical 
Center, China. 


+ Study on the characteristics of physical 
anthropology in essential hypertension. Dr. 
Xingdong Chen, Fudan University Taizhou 


Health and Sciences Institutes, China. 


+ Functional validation of GWAS signals. Dr. 
Lan Wang, The Institute of Health Sciences, 


SIBS, CAS, China. 


+ Visigen Consortium. Prof. Andres Ruiz-Lin- 


ares, University College London, UK; Prof. 
Nicholas Martin, The Queensland Institute 
of Medical Research, Australia; Prof. Manfred 
Kayser, Erasmus University Medical Cen- 

ter Rotterdam, The Netherlands; Prof. Tim 
Spector, King’s College London, UK. 


+ Environmental and genetic epidemiology 


of aging. Prof. Jean Krutmann, Leibniz Re- 
search Institute for Environment and Medi- 
cine, Germany. 


+ The research of genetics of facial features 


and dental morphology and human skin di- 
versity. Prof. Andres Ruiz-Linares, University 
College London, UK. 


+ Gene-environment interaction on skin 


aging. Prof. Tim Spector, King’s College Lon- 
don, UK. 


+ Functional validation for fingerprint pattern 


gene. Prof. Denis Headon, University of 
Edinburgh, Scotland. 


External Funding 


+ Micro-evolution of the EDA pathway genes 


and the impact on morphological features 
in East Asians, Major Research Plan (Foster- 
ing Project), National Science Foundation of 
China (Grant No. 91331108) 


« Adaptive evolution in East Asian, Excellent 


Young Scientists Fund, National Science 
Foundation of China (Grant No. 31322030) 


+ The evolution and genetic mechanisms of 
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pigmentation related phenotypes in East 
Asian and European populations, Major 
Research Plan (Integrated Project), National 
Science Foundation of China (Grant No. 
91631307) 


+ Phenotype database construction and ge- 


netics analysis research, Science and Tech- 
nology Commission of Shanghai Municipal- 
ity (Grant No. 16JC1400504) 


+ Measuring physical traits in Chinese popu- 


lations, Ministry of Science and Technology. 
(Grant No. 2015FY111700) 


+ Genetic differences in odour perception - a 


phenotyping study, Unilever Company. 


+ Genetics and facial ageing appearance in 


China, Unilever Company. 


+ The pleiotropic effect of EDAR gene in the 


Uyghur population, Open project for the 
National Key Laboratory of Genetic Engi- 
neering, Fudan University. 


+ The impact of air pollution on skin aging, 


CAS President's International Fellowship 
for Visiting Scientists, Chinese Academy of 
Sciences. 


+ Constructing and applying partial least 


square path models in genetic analysis of 


skin aging, Young Scientists Fund, National 
Science Foundation of China (Grant No. 
31401061) 


+ “Multiple-gene-mutliple-phenotype” 


genetic analysis method construction 

and application in skin aging study, Junior 
Scientists Program of Shanghai Institutes 
for Biological Sciences, Chinese Academy of 
Sciences (Grant No. 2014KIP214) 


+ Genetics and Development, course for 


Teaching (2014-2017) 


+ Human Evolution Genetics, course for 


qt 
year graduate student at Fudan University, 
2014-2016 

15t 
year graduate student at Fudan University, 
2014-2016 


Invited Talks (2014-2017) 


+ Shanghai Foreign Language School, Shang- 


hai, China, May 2014. 


+ MPS Evidence Identification Center, Beijing, 


China, August 2014. 


+ Unilever Workshop, Shanghai, China, No- 


vember 2014. 


e National Science Foundation Workshop, 


Wuhan, China, December 2014. 


+ CC. Tan International Symposium, Shang- 


hai, China, April 2015. 


+ Qinghai Nationalities University-Fudan 


University Workshop of Statistical Genetics, 
Xining, China, June 2015. 


+ Unilever-SIBS Workshop, Shanghai, China, 


September 2015. 


+ Fragrant Hill Conference on Health Big Data 


and Precision Medicine, Foshan, China, 
October, 2015. 


+ Biomedical Big Data and Precision Medicine 


Forum, Shanghai, China, November 2015. 


+ Minister of Education Key Laboratory of 


Contemporary Anthropology, Fudan Uni- 
versity, Shanghai, China, December 2015. 


+ The 2"? JID International Forum on Derma- 


tology, Shanghai, China, May 2016. 
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+ The 2"? Chinese Conference on Health 
Services For the Aged Promotion, Shanghai, 
China, June 2016. 


+ The Global Conference of Chinese Geneti- 
cists, Hangzhou, China, September 2016. 


+ The Annual Meeting of the Chinese Anthro- 
pological Society, Shanghai, China, Novem- 
ber 2016. 


e Frontier in Genetics and Epigenetics, The 
4" China Young Geneticist Forum, Shang- 
hai, China, November 2016. 


+ Unilever Company, Colworth, United King- 
dom, December 2016. 
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4.1 Biophysics (Max-Planck-fellow-group) 


Researchers: 


Prof. Dr. Klaus Gerwert 

(Head of Max-Planck Fellow Group) 
Phone: +86-21-5492 0453 
Fax:  +86-21-5492 0451 
Email: Gerwert@bph.rub.de 


Staff 

Prof. Dr. Axel Mosig 
Priv.-Doz. Dr. Jürgen Schlitter 
Dr. Till Rudack 

Dr. Sascha Krauß 


Peter Serocka (Head of PICB IT; in 
Biophysics Dept. until August 2013; 
staff member until March 2017) 


Research 
Overview 


The Department of Biophysics was established 
with the appointment of Klaus Gerwert as 
Department Director in November 2008. 

The research philosophy in the department 
follows a bottom-up approach, beginning 

at the protein level in vitro, and ends at the 
level of protein interactions in vivo. This multi- 
level combination will provide a detailed 
understanding of interactions within protein 
networks in the living cell. Alterations in these 
interactions, for example due to oncogenic 
mutations, can cause severe diseases like 
cancer. A detailed understanding is the 
prerequisite for a personalized diagnosis and a 
personalized therapy. In addition, novel marker 
free vibrational imaging methods (Infrared and 
Raman) are established, to characterize cells 
and tissue. This will provide novel non-invasive 
diagnostic tools: spectral cytopathology 

and spectral histopathology. Klaus Gerwert 
stepped down from his position as Director in 


Students (Shanghai) 


Hang Xiao (until May 2014) 
Chen Yang (until May 2016) 
Qiaoyong Zhong (until May 2014) 
Liang Zhu (until May 2014) 


Students (Bochum) 


Matthias Massarczyk 
Stefan Tennigkeit 


Staff (Shanghai) 


Yujie Chen 
Dr. Daniel Mann 
Dr. Dennis Petersen 


2013, becoming head of the Max-Planck-fellow 
group as PI. 


The department interacts and exchanges 
graduate students and postdocs with the 
biomolecular simulations group at the 
Biophysics department and the Bioinformatics 
group of Axel Mosig at Ruhr-University 
Bochum. All members of the Bochum groups, 
which belong to the Max-Planck fellow group, 
also spent a significant amount of time in 
Shanghai, mostly Dr. Steffen Wolf, who left 
the PICB in March 2016, accepting a Senior 
Scientist position at the Institute of Physics at 
the University of Freiburg. 


Collaborations exist with the group of Stefan 
Grunewald at PICB, on the field of evolution of 
G-protein coupled receptors. 


Axel Mosig joined PICB as a founding member 
in 2005 and became a PI in July 2008, 
establishing the research group for Pattern 
Discovery in Biology. In April 2011, Axel Mosig 
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accepted an offer for a faculty position in the 
faculty of Biology and Biotechnology at Ruhr 
University Bochum, Germany, keeping an 
active part of the PICB Biophysics department 
through supervising PhD students and as a 
CAS visiting professor, supporting frequent 
visits between Bochum and Shanghai. His 


research has focused more strongly towards 


computational bioimaging, supplemented by 
a collaboration with the Plant Systems Biology 
group on non-coding RNA computational 
biology. Since 2015, Axel Mosig is a full 
professor at the Ruhr-University Bochum. 


Image analysis projects in PICBs biophysics 
department complement the applied image- 
based biomarker discovery work conducted in 
Bochum. Here, PICB with its computationally 
oriented setup facilitates the development 

of conceptually new algorithmic approaches 
for microscopic image analysis that can be 
transferred into the applied setting in Bochum. 


Peter Serocka joined the Department of 
Biophysics in October 2010. Due to his 
expertise in high-performance cluster 
computing as well as hardware-accelerated 
(GPU) data processing and visualization, he is a 
major support for research in the department. 
He furthermore is the administrator of the 
Biophysics HPC cluster as part of the PICB 
main computing cluster. In close collaboration 
with Axel Mosig's group and Klaus Gerwert’s 
Department in Bochum, he develops software 
for real-time visualization and functional 
analysis of hyper-spectral image data medical 
research and cell biology, as produced by 
emerging technologies like FT-IR and Raman 
spectral microscopy. His major contribution in 
this field is the “Lasagne” software package. 


The Department has actively organized 
international conferences, to actively connect 
the institute with the Chinese and the 
international research community. The two 
major events in the list of organized events 


are the Cold Spring Harbor Asia conference 


“GTPases: Mechanisms, Interactions and 
Applications” at Suzhou in September 2014, 
and the "16" European Conference on 

the Spectroscopy of Biological Molecules 
(ECSBM)" in September 2015. An entire list of 
the internationally renowned speakers can be 
found in the appendix. 


Research 


Overcoming the large gap between the 
detailed functional understanding of protein 
structures, dynamics, and mechanisms 

in structural biology, and the descriptive 
understanding of protein interactions in 
systems biology poses a major challenge 

in modern life sciences. In the Department 


of Biophysics, we work on filling in this gap 
by a close multidisciplinary orchestrating of 
molecular biology, X-ray structure analysis, 
time-resolved vibrational (FTIR) spectroscopy, 


and biomolecular simulations, which provides 


spatio-temporal resolution of protein 


interactions at different scales. This approach 


provides the key to bridge the existing gap 
and to target at a detailed understanding of 


protein interactions close to physiological 
conditions. While the experimental work is 

accomplished at the biophysics department 
in Bochum, the biomolecular simulations are 


performed mainly at the PICB in Shanghai. 


In this multidisciplinary framework, the 
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biomolecular simulations are crucial to 
understand the complex experimental data 
sets, and draw conclusions on the dynamic 
protein mechanisms. In order to bridge the 
different scales and disciplines we actively 
take part in public available software 
development. We developed a novel method 
called Local Mode Analysis (LMA, described 
in section 1.3) to decode functional processes 
from IR Spectra by visualizing molecular 
details through biomolecular simulations. 


Small GTPases (Ras, Ran, ..) 


GDP GTP Signal 
GEF 


Ras (inactive) Ras (active) | 


Y 


Signal 


Furthermore, we developed the public 
available Lasagne software. Lasagne is a 
software to interactively explore multispectral 
microscopic images. For infrared microscopic 
images of tissue samples, it can be used to 
characterize spectra that are characteristic 

for different diagnostically relevant tissue 
components. As further detailed below, 

there are three different major projects under 
investigation. 


Heterotrimeric GTPases (Gai1) 


GDP l ) GTP Signal 


Ga (inactive) Ga (active) , 
GAP 


Signal 
® g 


Figure 1. Cell signaling. External signals within crucial cellular processes like cell division are mediated via small and 
heterotrimeric GTPases. See Mann, Teuber, Tennigkeit, Schröter, Gerwert, Kötting PNAS, 2016 and Rudack, T., Fei X., Schlitter, J., 


Kötting, C. and Gerwert, K, PNAS 2012 for details. 


Project 1: QM/MM simulation of G 
protein interactions 


The Ras superfamily of small GTPases is 
responsible for the control of many important 
cellular processes, most prominently for cell 
division. Heterotrimeric GTPases maintain 
physiological processes like vision, scent or 
blood pressure regulation. These proteins 

are switched into an active signaling state 
(‘on” state) by GDP to GTP exchange, and 
return into an inactive “off” state by a GTPase 


reaction (Figure 1). The defined control of 

the GTP hydrolysis is of vital importance for 
the regulation of several signal transduction 
processes in the living cell. Of high importance 
for medicinal chemistry is the major role 

of the small GTPase Ras in the cell growth- 
signaling pathway: Site-specific oncogenic 
mutations inhibit the GTPase reaction, which 
causes the loss of down regulation of the 
downstream growth signal. This contributes 
to uncontrolled cell growth, and finally results 
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in the development of tumors. Common to all The enzyme catalysis is mediated by sub-A 
members of the Ras superfamily is the catalysis structural changes of the substrate, induced 
triphosphate hydrolysis by the G-domain: by the protein environment (see Mann, Teuber, 
Compared to hydrolysis of GTP in water, the Tennigkeit, Schröter, Gerwert, Kötting PNAS, 
hydrolysis in Ras is accelerated by five orders 2016). We have developed and established 
of magnitude from 200 days to 30 min under a method to decode the three dimensional 
physiological conditions. However, to control structural information and structural details of 
growth signals in the living cell, GTPase- spectroscopic data by a combination of FTIR 
activating proteins (GAPs) accelerate catalysis measurements and biomolecular simulations 
by further five orders of magnitude from 30 (see Mann, Howeler, Kötting, Gerwert, Biophys. 
minutes to 50 ms. J. 112, 2017). The time resolved FTIR difference 
One of our major research topics is to spectroscopy is highly sensitive regarding 
changes in structure and charge distribution. 


elucidate in detail, how this remarkable i a <j | 
a ; F 7 In combination with biomolecular simulation, 
catalysis is obtained. A detailed understanding 


; . we can generate validated structural models 
of structure, dynamics and function of 9 


, ; i with an accuracy of up to 0.01 Å. These 
Ras proteins and their oncogenic mutants VOUP 


enables the targeted development of drugs 
that attack dysfunctional proteins in order to 
cure diseases. Within the report period we 


structures are experimentally validated by the 
comparison of the calculated spectral features 
like vibrational modes, isotopic shifts, and 
difference spectra with experimental ones. 


decoded how the Ras protein enzymatically 


aay the GTP hydislyss: We untaveled Only this high resolution allows for a complete 


, understanding of catalysis. 
the structures of the important steps of the 9 y 
catalysis reaction mechanism with subA 


resolution (Figure 1). 
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Figure 2. Catalysis of GTP hydrolysis by small GTPases at atomic detail by integration of X-ray crystallography, 
experimental, and theoretical IR spectroscopy. We showed that the red shift of the asymmetric aphosphate vibration 
between Ras and Ran is not as previously believed due to a different Mg** coordination but reflects a hydrogen bond 
between Thr-25 and the aphosphate oxygens in Ran that do not exist in Ras, as proven by the agreement of theoretical and 
experimental IR spectroscopy studies. Therefore, the Mg** coordination cannot be the reason for the slower GTP hydrolysis 
rate of Ran compared to Ras. See Rudack, Jenrich, Bruker, Vetter, Gerwert, Kotting, JBC 2015 for details. Small GTPases regulate 
key processes in cells. Malfunction of their GTPase reaction by mutations is involved in severe diseases. In the previous report 
period we performed intensive investigations of the small GTPase Ras. 


1.1 Catalysis of GTP Hydrolysis by 
small GTPases 


In the present report period we compared 

the GTPase reaction of the slower hydrolyzing 
GTPase Ran with our previous results for Ras. By 
combination of time-resolved FTIR difference 
spectroscopy and QM/MM simulations we 
elucidate that the Mg* coordination by the 
phosphate groups, which varies largely among 
the X-ray structures, is the same for Ran and 
Ras (Figure 2). This finding was confirmed by 
our collaborator at the MPI who obtained a 
new X-ray structure of a RaneRanBD1 complex 
with improved resolution (PDB-ID 5CLL). The 
Mg” coordination is not responsible for the 
much slower GTPase reaction of Ran. Instead, 
the location of the Tyr-39 side chain of Ran 
between the y-phosphate and Gln-69 prevents 
the optimal positioning of the attacking 
water molecule by the Gln-69 relative to the 
y-phosphate. The steric hindrance through Tyr- 
39 is confirmed in the RanY39A-RanBD! crystal 
structure. The QM/MM simulations provide 


IR spectra of the catalytic center, which agree 
very nicely with the experimental ones. 


The combination of both methods can 
correlate spectra with structure at atomic 
detail. For example the FTIR difference spectra 
of RasA18T and RanT25A mutants show that 
spectral differences are mainly due to the 
hydrogen bond of Thr-25 to the y-phosphate 
in Ran. This work is a proof of principle for 

the high sensitivity of the integration of 
theoretical and experimental IR spectroscopy. 
Here, we have predicted IR spectroscopic 
shifts of mutations, which afterwards were 
verified by FTIR spectroscopic measurements. 


The difference of one hydrogen bond was 
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resolved by both, theory and experiment. The 
integration of X-ray structure analysis, QM/MM 
simulations, and IR spectroscopy is a powerful 
tool to analyze the structure and dynamics of 
the catalytic center of a protein with atomic 
resolution. Only such atomic resolution 
provides the basis to understand protein 
catalysis. 


1.2 Catalysis of GTP Hydrolysis by 
heterotrimeric GTPases 


In order to find general motives and 
differences within the catalytic hydrolysis 
mechanism of GTPases we broadened 

the proteins we investigate from the Ras 
superfamily to heterotrimeric G proteins. 

By combination of FTIR spectroscopy and 
biomolecular simulations we have contributed 
to the understanding of the role of active site 


elements in the intrinsic and RGS4 catalyzed 


Figure 3. Catalysis of GTP Hydrolysis by heterotrimeric 
GTPases. In contrast to small GTPases the substrate 

GTP is incorporated by two domains in the Ga subunit 

of heterotrimeric GTPases (left). The additional domain 
contributes e.g. the intrinsic arginine finger, that enables 

fast intrinsic hydrolysis. We elucidated nucleotide exchange 
and hydrolysis kinetics of the inhibitory Ga subunit at high 
spatio-temporal resolution (See Schröter, Mann, Kötting, 
Gerwert J.Biol. Chem. 290(28), 2015 for details). Validation of 
the proposed computational models was achieved via QM/ 
MM calculations of the active center of Ga; (right). Several 
mutational studies were performed in vitro and in silico and 
were found in good agreement (See Mann, Héweler, Kotting, 
Gerwert, Biophys.J. 112, 2017 for details). 


PI INDEPENDENT RESEARCH GROUPS 


GTP hydrolysis in heterotrimeric G-Proteins 
(Figure 3). 


After elucidating the separated kinetics for 
the on-reaction (GDP to GTP exchange) and 
the off-reaction (GTP hydrolysis) using an 
orchestration of fluorescence spectroscopy, 
infrared spectroscopy, and simulations, and 
assigning the individual educt and product 
infrared bands of GTP, GDP, and Pi in Gail, 
we investigated single point mutations in 


the active site. The measured FTIR spectra 

of these mutants contain information of the 
corresponding active site residues with high 
spatio-temporal resolution. However, in order 
to decode these spectra, coupled QM/MM 
calculations have to be performed. Therefore 
we developed a fast and robust workflow that 
was extensively validated by point mutations 
that affected the infrared bands of each 
single GTP group and '80 isotopic labelling. 
Measured and calculated spectra were found 


in good agreement, which enabled application 
of this workflow on investigating the role 

of active site amino acids in heterotrimeric 
G-Proteins. Especially the role of a conserved 
Lysine in the P-Loop of GTPases and ATPases 
and the intrinsic Arginine “finger” in Gail was 
elucidated. Both highly conserved amino acids 
maintain charge transfer from y-GTP toward 
B-GTP and thereby facilitate a product-like 
charge distribution that significantly catalyzes 
GTP hydrolysis. Addition of the GAP RGS4 
further catalyzed GTP hydrolysis by altering 
the position of the intrinsic arginine finger 
from a monodentate y-GTP coordination 
toward a bidentate a-y-GTP coordination. This 


movement was accompanied by increased 
planarity of the y-GTP group and a twist of 
a-GTP into an eclipsed geometry, similarly 
to the Ras-GAP system. These geometrical 
alterations create a strain in the substrate and 
further accelerate GTP hydrolysis (Figure 4). 


Ga, + RGS4 


Figure 4. Mechanism of the intrinsic arginine finger in heterotrimeric G-Proteins. Addition of the 
GAP RGS4 pushes the arginine finger from a monodentate y-GTP coordination toward a bidentate a-y-GTP 
coordination. This movement is accompanied by charge transfer toward B-GTP, increased planarity of y-GTP 
and a twist of a-GTP toward an eclipsed GTP geometry. See Mann, Teuber, Tennigkeit, Schröter, Gerwert, 


Kötting PNAS, 2016 for details. 
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1.3 Current projects: Elucidate 
the role of the nucleophilically 
attacking water during 
hydrolysis in small GTPases and 
heterotrimeric GTPases 


Small GTPases 


The long-term goal of the research field of 
small GTPases is to understand the effect of 
oncogenic mutations. Despite of detailed 
structural insights of ground state of the wild 
type and mutants the function of oncogenic 
mutations still remains elusive. It is believed 
that the mutants effect the transition states 
or the positioning of the nucleophilically 


visualization in spectra 


structure 


we) a? 


spectrum 
À 


wavenumber / cm 


attacking waters. These very rare and 
transient structures are hard to obtain 
experimentally and within biomolecular 
simulations. Therefore, we developed a novel 
method called Local Mode Analysis (LMA) for 
calculating IR spectra and assigning spectral 
IR-bands on the basis of movements of nuclei 
and partial charges from just a single QM/ 
MM trajectory (Figure 5). Through LMA the 
decoding of IR spectra no longer requires 
several simulations or optimizations. The novel 
approach correlates the motions of atoms of 
a single simulation with the corresponding 

IR bands and provides direct access to the 
structural information encoded in IR spectra. 


visualization in structure 


wavenumber / cm” 


y 


Figure 5. Local Mode Analysis. The bidirectional analysis is the unique feature of LMA connecting structural 
details to spectral features and vice versa spectral features to molecular motions. See Massarczyk, Rudack, 


Schlitter, Kuhne, Kötting, Gerwert, JPC B 2017 for details. 


For the software development part of the 
project we aim to develop an easy-to-use 
graphical user interface (GUI) to our theoretical 
approaches to decode structural information 
from experimental IR spectra. Through this 
GUI we will provide an intuitive approach not 
only for experts in the field of spectroscopy 

to better interpret their spectral data but also 
makes IR-spectroscopic results readily available 
to most structural biologists and biochemists. 


Heterotrimeric GTPases 


Modification or mutation of the intrinsic 
arginine finger in heterotrimeric GTPases is the 
origin of severe diseases like Cholera, cancer 
or the McCune Albright Syndrome. We and 
others demonstrated that the missing arginine 
finger in Ga,, can be compensated by the 
GAP RGS4. Actually to date no diseases that 
correlate with arginine finger alterations in 
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Ga,, are known. However, when the arginine 
finger is missing or modified in the isoform 
Gas the aforementioned diseases occur, 


presumably because to date no RGS protein 


is known that matches Gas. Our long-term 


goal is therefore the design of a RGS protein 
that accelerates GTP hydrolysis in Gas and can 


compensate for the missing arginine finger in 


this isoform. Mutations of the Ga subunit were 
already shown to enable RGS4 binding to Gas- 
D229S. Thus, Mutation of the RGS to enable 
GAP interaction should also be possible. The 
interaction motif shall be characterized via 
biomolecular simulations to suggest point 
mutations for the following experiments. 


Following the same ideas as for small GTPases 
we want to continue our research on active 
site mutants, e.g. the catalytic glutamine that 
is thought to position the nucleaophilically 
attacking water molecule during GTP 
hydrolysis. 


1.4 Medium and Long Term 
Research Goals 


In future studies we aim to 


+ unravel the role of oncogenic Ras mutants 
and other GTPases. 


+ find general motives and differences within 
the catalytic hydrolysis mechanism of 
other members of the Ras superfamily, and 


heterotrimeric G proteins. 


+ elucidate the role of the nucleophilically 
attacking water during hydrolysis. 


+ design a RGS protein that accelerates GTP 
hydrolysis in Gas. 


+ clarify structure and dynamics of the 


transition state of GTP hydrolysis by taking 
the new structural insights about the educt 
states in Ras, Ran, and Ga into account. 


+ understand the interaction with effectors, 
and how and why very different pathways 
like cell growth and apoptosis can be 
activated by the same Ras protein. 


+ clarify the role of the membrane assembly 
of small GTPases, especially the Ras 
dimerization interface as drug target. 


Project 2: Reaction mechanisms 
and interactions of heptahelical 
membrane proteins 


A major challenge for current biophysical 
research is the understanding of both, detailed 
functional mechanisms and interactions of 
membrane proteins, especially of heptahelical 
membrane-proteins. Within Project 2 we 
focus on microbial rhodopsins, specifically 
bacteriorhodopsin (bR), channelrhodopsins, 
and G-protein coupled receptors (GPCRs), 

in particular olfactory receptors. Our major 
interest lies in the interaction between 
proteins and small molecules, such as retinal, 
small organic ligands, and water molecules. 
Our research group has especially contributed 
to the understanding of the role of water 
molecules in protein structure stabilization 
and protein-internal proton storage and 
transduction. Our major focus for these studies 
has been the prototypical protein proton 
pump and microbial rhodopsin bR. Figure 6 
summarizes our current understanding of the 
vectorial proton transfer through bR with the 
help of protein-bound water molecules. 


PI INDEPENDENT RESEARCH GROUPS 


116 


release 
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Figure 6. Proton transfer via protein-bound water molecules in bacteriorhodopsin: Water molecules are involved 
in the protonation of Asp85 by the retinal Schiff base (1), the release of a water-delocalized excess proton into the 
extracellular bulk water (2), and the proton transfer from Asp96 to the retinal Schiff base during reprotonation. See 


Gerwert et al, BBA-Bioenergetics 2013 for details. 


In addition to bacteriorhodopsin the light- 
gated ion-channel channelrhodopsin-2 (ChR2) 
that is highly important for neurobiology, 

has become the major focus of our research. 
The so-called “optogenetics” have become a 
promising, fast evolving branch. Optogenetic 
applications are advantageous, because they 
are less invasive than other methods and 
provide a high spatial and temporal resolution 
in neurologic stimulation. This revolutionary 
method is based on the usage of genetically 
targetable photosensitive proteins in order to 
control the activity of excitable cells by light. 
Within this field, ChR2, a microbial rhodopsin 
from Chlamydomonas rheinhardtii, has become 
a major tool. ChRs in general are members of 
a specific subclass of sensory rhodopsins that 
mediate phototaxis and photophobic behavior 
in flagellate green algea. Despite the variety 
of applications in neurobiology, little is known 
about the molecular mechanisms of ion pore 
formation. In this study, we aim to close this 
gap by incorporating information we get from 
biophysical methods such as time-resolved 


FTIR-, Raman- and UV/Vis spectroscopy into 
our Molecular Dynamics simulations. 


Furthermore, we have extended our research 
interest towards GPCRs, which are the largest 
group of membrane receptor proteins within 
the human genome. Our special interest lies 
in olfactory receptors, which are the largest 
subgroup of GPCRs. As it was recently shown 
that these receptors are distributed all over 
the human body, and their expression is up- 
regulated and linked to cell proliferation in 
different types of cancers, they now become 
an interesting target for cancer therapy. 
Furthermore, X-ray crystallography of GPCR 
had a breakthrough in 2007, and since then, 
about thirty X-ray structures of receptors have 
been resolved. In an internal PICB collaboration 
with Stefan Grünewald, we focus on the 
combination of structural information and 
bioinformatics and mathematics methods to 
shed light into the evolution of GPCRs, which 
so far has only been investigated by protein 
sequence-based bioinformatic investigations. 
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Figure 7a. Analysis of the contribution of different 
delocalization modes to delocalization at the bR release 

site (see Wolf et al, Biophys, J. 2014). (A) Schematics of the 
two minimal model QM systems resulting in a calculated 
continuum band: glutamate-shared proton and protonated 
water cluster. (B) Exponential fit of the continuum band 
between 2300 cm” and 1800 cm" (orange) in comparison 
with the calculated spectra for protonated water cluster 

QM box (black), Glu-shared proton QM box (cyan), and the 
small QM box (blue). Over this spectral range, the continuum 
band exhibits the form of an exponential function. Below 
1800 cm”, the experimentally observed continuum band 
(red) overlaps with the carbonyl vibrations of Asp85, Asp96, 
and Asp115 between 1765 cm” and 1715 cm”. Therefore, the 
experimental curve is upshifted as compared with the fit. 
Between 1715 cm and 1705 cm”, where no other bands are 
observed, the fitted curve again matches the experimental 
curve. The overall shape of the exponential fit is in best 
agreement with the line shape from simulations with a 
protonated water cluster QM box. 


2.1 Proton transduction 
via water molecules within 
bacteriorhodopsin 


Continuing our work on transient water 
structures for proton storage and conduction 
in bacteriorhodopsin (bR), we now set out to 
make a clear-cut comparison of experimental 
IR spectra and theoretically predicted 


intraprotein water motifs by calculating their 
theoretical infrared spectra from ab initio 

OM simulations in a representation of the 
protein as MM model. We did this project 

in cooperation with Qiang Cui from the 
University of Wisconsin at Madison, WI, USA, 
who introduced Steffen Wolf to the usage of 
the CHARMM program package and the usage 
of DFTB3/CHARMM QM/MM calculations. 


In this subproject, we focused on calculations 
of spectra of both the proton release and 

the proton uptake site (see Figure 6). For 

the proton release site, we found that the 
experimentally observed continuum band 
present between 2100 and 1700 cm” best fits 
to a simulation with the excess proton stored 
at the release site being shared between 

the water molecules present in this protein 
domain (see Figure 7a). 


For the proton uptake site, we found that the 
theoretical spectra of three water molecules 
simulated in this protein domain reproduce 
nicely the experimentally observed O-H 
stretch vibrations (see Figure 7b). Surprisingly, 
we found that motifs of different numbers 

of water molecules result in characteristic 


IR lineshapes, by which they can be clearly 
differentiated. In our case, the lineshape 
of three water molecules matched the 
experimental spectra very well, while for 


or five molecules did not result in the right 
lineshape. This result allows for the structural 


elucidation of functional water motifs within 


proteins, which cannot be observed via other 
methods due to their high flexibility. 
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Figure 7b. Spectral signatures of the water wire calculated with SCCDFTB QM/MM simulations in comparison with 
experimentally obtained BR-N spectra (see Wolf et al, J. Chem. Phys. 2014). Experimental spectra are in red and calculated 
spectra are in black. Frequencies were scaled by a factor of 0.962 to match calculated and measured dangling bond 
vibrations. (Left) Dangling O-H bond range. The experimental spectrum shows a peak (hatched with horizontal lines) 

with two maxima at 3670 cm" and 3658 cm”, which was attributed to two water molecules with dangling O-H bonds 
appearing in the water wire. In agreement with this, the simulations show one major combined absorption peak (highlighted 
in grey). The negative experimental peak at 3643 cm” is caused by a water molecule with a dangling O-H bond at the 
complex counter ion of the Schiff base. (Right) Baseline-corrected strong hydrogen bond range. The experimental spectrum 
shows a broad double-hump shaped continuum absorbance feature between 2775 cm” and 2525 cm”. In agreement with 
this, the calculations result in a similar broad double-hump shaped continuum absorption feature between 2700 cm and 


2530 cm”. 


2.2 Protonation changes and pore 
formation in channelrhodopsins 


In our previous publications on ChR2 (see 
Eisenhauer et al, JBC 2012) we showed that 
the deprotonation of the glutamic residue 
E90 causes water influx into the conductive 
pore. Significant improvements in our FTIR- 
spectroscopic setup allowed us to monitor 
the time course of the E90 protonation state 
with high temporal resolution. We observed 
an ultrafast deprotonation of E90 no later than 
2 us after start of the reaction. This indicates 
a direct coupling of E90 to chromophore 
isomerization. Using an improved ChR2 

WT homology model based on the crystal 
structure of a ChR C1C2 chimera (PDB-ID: 
3UG9; Kato et al., Nature 2012) we examined 
the effect of the isomerization on the 
hydrogen bonding network involving E90 
and N258. In the ground state all-trans retinal 
configuration E90 on Helix2 is connected to 
N258 on Helix7 via two hydrogen bonds. To 
model the all-trans > 13-cis isomerization of 


the retinal we rotated around the C13=C14 
double-bond in 20° steps (for details see 
Kuhne et al. Angew. Chemie Int. Ed. 2015). 

We observed an isomerization-induced 
displacement of N258 which leads to a 
disruption of the hydrogen bonds to E90. 
After that E90 flips downward and is able to 
deprotonate (see Figure 8A). This mechanism 
explains the ultrafast deprotonation observed 
in our FTIR experiments. The MD-Simulations 
further show that after E90 deprotonation an 
outward tilt of Helix2 occurs with E90 working 
as a hinge for this movement (see Figure 8B). 
The observed helix movement is in agreement 
with ESR-measurements (Krause et al. FEBS 
Letters 2013). The tilt is caused by water influx 
into the protein that weakens the hydrogen 
bonds of the inner gate. Thereby a continuous 
aqueous pore through the protein, with E90 at 
the narrowest point, is formed (see Figure 9). 
The water influx has later been independently 
confirmed by FTIR measurements of other 
groups (Lérenz-Fonfria et al. PNAS 2015). 
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This process reflects a pre-formation of the 
conductive pore that we termed the E90- 
Helix2-tilt (E-H-T) model (Kuhne et al. Angew. 
Chemie Int. Ed. 2015). 


2.3 Ligand recognition in olfactory 
receptors 


13-cis Retinal 


E90 protonated 
E90 deprotonated 


Figure 8. All-trans to 13-cis isomerization and Helix2 
movement in ChR2 based on MD simulations. (A) The side 
chains are colored and the protein backbone is shown 

in gray. The all-trans and the fully rotated 13-cis retinal 
configurations are shown in yellow and magenta. RSBH* 
isomerization (1) is followed by K257 and N258 displacement 
(2), loss of one hydrogen bond (3), E90 outward flip (4), 

and E90 deprotonation (5). (B) Movement of Helix2 as 

a consequence of the events shown in (A). Upon the 
deprotonation and outward flip of E90, Helix2 is tilted 
outward by 3.9 A on the intracellular side of the protein. E90 
within Helix2 works as a hinge for this movement. 


Figure 9. ChR2 in (A) the closed and (B) preopen state. 

The isomerization-induced deprotonation of E90 leads to 

a preopening of the pore. In the closed state the hydrogen 
bonds between E90 and N258 (central gate) as well as E83 
and R268 (inner gate) connect Helix2 with Helix7. Both 
connections prevent water influx into the intracellular 
vestibule. In the preopened state these connections are lost 
because of the flip of E90 from the inward (IC) to the outward 
(EC) orientation and water influx into the protein. 
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As a second branch of investigations on 
heptahelical proteins, we analyzed ligand 
binding in G-protein coupled receptors 
(GPCRs). GPCRs form the largest family of 
membrane receptors in the human body, 
and are the target for approximately 40% of 
all drugs commercially available. In a close 
cooperation with the Department of Cell 
Physiology at the Ruhr-University Bochum, 
we focus on research of olfactory receptors, 
which form the largest subclass of GPCRs, and 


currently emerge as possible new target for 
cancer treatment. Continuing our earlier works, 
in which we established a model of matching 
receptor/ligand dynamics as major requisite 
for olfactory receptor activation, we focused 
on the olfactory receptor hORS1E2, which is a 
putative drug target for the treatment of both 
prostate and skin cancer. 


Figure 10 displays our recent findings: In 


this project, we were especially interested 
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Figure 10. Activity tests of hORSTE2 ligands in silico and in vitro. Top: residence volume for B-ionone (agonist, left) and the 

two enantiomers of a-ionone (antagonists; middle and right). No clear binding position can be observed in MD simulations. 
Bottom left: experimental co-application of rac-a- and B-ionone. Both compounds exhibit approximately the same affinity to 
the receptor. Middle: docking results for ionones in the inactive receptor model. All compounds exhibit approximately the same 
affinity, which agrees well with the experiment. Bottom right: binding free energies calculated from MD simulations. Both a- and 
B-ionones exhibit the same affinity for the inactive receptor. For the active receptor, the affinity of a-ionones is worse, while the 
one of B-ionone remains the same, which is in agreement with a-ionones being antagonists, and B-ionone being an agonist. 


in understanding the effect of different 
pharmacological classes of ligands onto 
hORSIE2. Experimentally, one agonist 
(B-ionone) and one antagonist (rac-a-ionone) 
are known to exist. To our surprise, we could 
not identify any clear binding mode of the 


ligands within a receptor homology model, 
and the investigated compounds exhibited a 
high degree of flexibility within the protein. 
However, binding free energies from these 
highly dynamic binding poses were in 

good agreement with trends observed in 
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Figure 11. Development of peptide and small molecule binding GPCRs by protein self-interaction with extracellular loop 

2 (el2). Peptide ligand binding volume in rhodopsin-like GPCRs as mesh, small organic ligands in grey sticks with polar 
oxygen / nitrogen atoms as spheres, el2 as yellow cartoon, N termini as red cartoon. In all known non-rhodopsin GPCR 
structures (purple circle), el2 forms a B-hairpin structure, which surrounds the peptide binding volume. In rhodopsin-like 
peptide receptor structures with bound peptide ligand (blue circle), el2 forms a B-hairpin, which flanks this domain, too. 

This common arrangement suggests that during evolution, both rhodopsin-like and non-rhodopsin peptide binders kept 
common ancestral peptide ligand binding features. Furthermore, both classes kept a common position and shape of el2. The 
length of the rhodopsin el2 is in good agreement with the one observed in peptide-binding GPCRs. In rhodopsin (green circle, 
top), el2 goes right through the common peptide / small molecule ligand volume, and is held there by steric constraints from 
the N terminus. Retinal binds underneath and outside of this contact domain at a position close to the protein center. In the 
sphingosine receptor (green circle, bottom), el2 is disordered and has contracted in comparison to rhodopsin, while still being 
in contact with the N terminus. The ligand has advanced into the common ligand domain, performing contacts there. We 
therefore assume that an opsin precursor forms the link between peptide-binding GPCRs and small molecule-binding GPCRs. 
In small molecule-binding GPCRs (red/ orange circle), el2 is disordered or helical and has retracted from the peptide-binding 
domain. Small molecule ligands bind in the peptide contact volume at the position of el2 in rhodopsin, and below it. It seems 
that the ancestral rhodopsin-like GPCRs initially possessed an el2 in the form of a B-hairpin. Peptide binders retained el2 in 
this form. Small molecule ligands for GPCRs seem to have then developed as allosteric binders via an opsin ancestor as key 
intermediate, with el2 substituting a bound peptide ligand. Along the proposed pathway, small molecule ligands then seem to 
have substituted el2, who retracted from the binding site and lost the B-hairpin conformation. Purine receptors seem to have 
undergone a similar development. See Wolf & Grünewald, PLOS ONE 2015 for details. 


experiment, and mutation analysis allowed 2.4 Evolution of GPCR structures 


us to verify the involvement of residues 
predicted in simulations in the binding of In a project together with Stefan Grunewald 


ionones. Furthermore, we did not see a from PICB and Raymond Stevens from 


ShanghaiTech University, we investigated the 
evolution of the rhodopsin-like GPCR family. 
We performed the first phylogenetic analysis 


correlation between protein/ligand contacts, 
but of ligand mediated protein-internal polar 


contact changes with receptor activation. 


We conclude that ionones enter the ligand 
binding site of hOR51E2 and remain very 
flexible within the receptor, but modulate 
protein-internal contacts in such a way that 
the active structure is stabilized. A manuscript 
summarizing this project is currently under 
review. 


of GPCRs based on a set of 24 GPCR structures 
available at the time. As shown in Figure 11, 
we defined a new phylogenetic tree of known 
human rhodopsin-like GPCR sequences based 
on this structure set. We can distinguish the 
three separate classes of small-ligand binding 
GPCRs, peptide binding GPCRs, and olfactory 
receptors. Analyzing different structural 
subdomains, especially the extracellular loop 
2 and the binding sites, we found that small 
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molecule binding receptors most likely have 
evolved from peptide receptor precursors, 
with a rhodopsin/S1PR1 ancestor, most likely 
an ancestral opsin, forming the link between 
both classes. A light-activated receptor 
therefore seems to be the origin of the small 
molecule hormone receptors of the central 
nervous system. We furthermore found hints 
for a common evolutionary path of both 
ligand binding site and central sodium/water 
binding site. Surprisingly, opioid receptors 
exhibit both a binding cavity and a central 
sodium/water binding site similar to the one of 
biogenic amine receptors instead of peptide 


receptors, making them seemingly prone 

to bind small molecule ligands, e.g. opiates. 
Our results therefore gave several surprising 
new insights into the relationship and the 
pharmacological properties of rhodopsin-like 
GPCRs. 


2.5 Current projects: pore-opening 
mechanism of Channelrhodopsin 
2, ligand binding in GPCRs 


a) Channelrhodopsin 

Following our earlier research on ChR2, we are 
currently investigating additional isomerization 
events which we found in experiments 

at Bochum. Our recent Raman and FTIR- 
spectroscopic measurements show that an 
early splitting of the photocycle involving an 
all-trans, @N-anti > 13-cis, GN-anti and an 
all-trans, @N-anti > 13-cis, GN-syn occurs. In 
future we plan to incorporate these findings 
into our MD-simulations and analyze the 
response of the protein. 


b) GPCRs 

Furthermore, we continue our research on 
ligand binding in olfactory receptors and will 
investigate the exact physicochemical nature 
of the ligand recognition mode in olfactory 
receptors (shape-based or vibration-based) 
with the well-established olfactory receptor 
hOR17-4, which was one of the first olfactory 
receptors found to be ectopically expressed in 
human sperm cells. 


2.6 Medium and Long Term 
Research Goals 


+ We plan to establish QM/MM based meth- 
ods to calculate UV/Vis-absorbance spectra 
of retinal proteins. 


+ Using these methods we aim to identify 
residues that allow us to shift of the absorp- 
tion maximum of ChRs. 


+ We will investigate the impact of proton- 
ation changes that ultimately lead to the 
open state of ChRs. 


+ Finally we plan to extend our research to 
other ChRs that recently have been found. 


Project 3: Computational methods 
for spectral histopathology and 
quantitative microscopy 


Infrared and Raman microscopy allow to 
resolve detailed structures of cell and tissue 
samples in a label-free manner and have 
thus become an increasingly popular tool 
in histopathological and cytopathological 
studies. As the technology continues to 


proliferate, our aim from a bioinformatics 
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perspective is to develop computational manner. We have developed an adjusted 
approaches and tools that facilitate or version of mutual information that serves 
improve the application of infrared and as a target function to be optimized under 
Raman microscopy, in particular in diagnostic geometric transformations. Furthermore, a 
applications. Bioinformatics research at PICB simultaneous hierarchical decomposition in 

is tightly integrated with the research group object space and transformation space makes 
of Axel Mosig in Bochum, where infrared and it possible to register infrared images within 
Raman microscopy play a central role in the H&E images within minutes even if gigabytes 
clinical studies within the PURE consortium of imaging data are involved. 


headed by Klaus Gerwert. 


A second methodological contribution in the 
context of cross-plattform microscopy is a 


colocalization algorithm (Krauß et al., 2015) for 
3.1 Algorithms for cross-plattform the extraction of training data from Raman 


microscopy microscopic imaging data. The algorithm 


fies | computes hierarchical segmentations of 
Utilizing infrared and Raman microscopy for P g 


; , a ; : both Raman microscopic image and its 
diagnostic applications typically involves 


, P fluorescently or H&E stained counterpart. 
two or even more microscopy modalities. In 


l ae Based on these hierarchical decompositions, 
order to obtain training data for supervised 


highly correlated segments are identified that 
can be used to extract training spectra from 


classifiers, the label-free infrared or Raman 
microscopic images are typically overlaid 


; ; ; ; the Raman image. As shown in Figure 12, this 
with conventional light microscopic 


; , approach can be used to train highly reliable 
images of the same sample, for instance PP only 


classifiers in cytopathological applications. 


involving histopathological staining through 


hematoxylin and eosin (H&E) or fluorescence 


Hierarchical 
Cell Sample Clustering 


Raman Microscopy it 
r J 


Colocalization 
Algorithm 


labeling. Computationally, this requires 
algorithmic approaches to align observations 
between microscopic images obtained from 
different modalities. 


Colocalization- 


Classifier 
Supervised 


Learning KAEDA LA 
AR AN 


n this context, we have contributed two 


major approaches. First, we have developed 


H&E Staining 
> Pseudo Fluorescence Training Spectra 


a ` EJ Kaj = 
counterparts. The algorithm combines several 


improvements of state-of-the-art approaches Figure 12. Bioinformatics workflow diagram for training 
colocalization-based classifiers in spectral cytopathology. 


an image registration algorithm (Chen et 
al., 2015) for registering infrared or Raman 


microscopic images within their H&E stained 


in image registration to facilitate image 
registration in a both efficient and effective 
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3.2 Spectral cytopathology 


Using cellular material for diagnostic purposes 
is acommon approach for the diagnosis of 
several types of cancer such as cervical or 
bladder cancer, where sample material can 

be obtained non-invasively. Using Raman 

and CARS microscopy, we aim to overcome 
the low sensitivity that is inherent to existing 
bladder cancer cytopathological assays. 


Our approach is based on using fast CARS 
(Coherent anti-stoks Raman spectroscopic) 
microscopy to identify candidates for urothelial 
cancer cells from the large number of cells 
that is found within a typical urine sample. 
These preselected cells can be assessed in 
detail using a colocalization-based classifier 
for the pixel spectra of a Raman microscopic 
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image. As shown in Figure 13, the pixel spectra 
classification can be jointly analyzed in a 
spatial bagging process, which yields perfect 
classification at cellular level in our current 
sample cohort. 


In a preliminary study, we employed 
morphological and textual classifiers on a 
small subspectrum of the complete Raman 
spectrum. As it turns out, we can achieve an 
accuracy exceeding 90% by utilizing only 
three wavenumbers of the Raman spectrum 
when computing morphological and textual 
features based on a convolutional neural 
network. This result indicates the potential 
of CARS microscopy for cytopathology, 


where spectral images involving only few 
wavenumbers of the spectrum can be imaged 
at very high speed. 


32 cancer cells (8 patients) 


Spatial 
Bagging 


$ cancer 


Figure 13. Spectral cytopathology based on the spectral classifier shown in Figure 12. 
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3.3 Medium and Long Term 
Research Goals 


+ As indicated in our preliminary work on 
Raman cytopathology, convolutional 
neural networks provide an attractive 
approach to include morphological and 
textural information from spectral images 
for classification. This approach will be 
extended to the classification of infrared 
microscopic images of tissues in the future. 
The challenge in this perspective is to 
employ convolutional neural networks 
for multispectral imaging data that avoid 
an excess of parameters being created. 
These classifiers will be particularly 
relevant for both CARS microscopy and 
quantum-cascade laser based infrared 
microscopy, as both techniques allow very 
fast measurement at a limited choice of 
wavenumbers. 


+ We have recently successfully combined 
infrared microscopy of tissue samples with 
automated laser-microdissection based 
on label-free identification of cancerous 
regions. This allows to extract small but 
highly specific subregions of the tumor that 
can be further molecularly characterized. 
In combination with Li Yang's group, we 
aim to perform exome analyses of several 
regions within a single sample. This will 
allow to trace mutation patterns within 
the tumors, and thus provides a systematic 


approach to facilitate intra-tumor 
heterogeneity. 


Project 4: The Biophysics HPC 
cluster at PICB 


The Biophysics High-Performance Computing 
(HPC) cluster at PICB is the major platform 

for demanding tasks in Molecular Dynamics 
(Steffen Wolf et al.) and Bio-lmaging (Axel 
Mosig et al.). 


The Biophysics cluster consists of 60 rack- 
based Linux nodes (IBM, HP), which share the 
general infrastructure, such as network, job 
scheduling, file storage and backup, and server 
room environment, with the main PICB HPC 
cluster. The total capacity currently is 744 CPU- 


cores (Intel Xeon, AMD Opteron) dedicated to 
the Biophysics Department, and in addition 

a fair share of the 1132 cores in PICB’s general 
48-core and 80-core nodes is used through 
the institute-wide job scheduling. The latter 
nodes, purchased in late 2011, have been 
designed to fulfill the highest computing 
demands throughout PICB, and the Biophysics 
Department's demands for memory size and 
parallel execution capabilities using shared 
memory have been serving as one high-water 
mark. 


Data collection, processing and storing is done 
on PICB central network storage (NAS) cluster 
running EMC/Isilon’s OneFS operating system. 
It runs highly stable and allows for long- 
running computer jobs without interruption. 


It furthermore is the best solution for the high 
demands of data space resulting from the 
different research projects in the Department: 
as of February 2014, about 18 TB of trajectory 


data from molecular simulations plus 2 TB of 
imaging and spectral data are stored. 
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General Information 
Publications (2015-2017) 
Publications at PICB or in collaboration with PICB: 


+ Mann D, Howeler U, Kötting C, Gerwert 


K. Elucidation of Single Hydrogen Bonds in 
GTPases via Experimental and Theoretical 
Infrared Spectroscopy. Biophysical J., 2017, 1 
(112), 66-77 


+ Massarczyk M, RudackT, Schlitter J, 


Kuhne J, Kötting C, Gerwert K. Local Mode 
Analysis: Decoding IR Spectra by Visualizing 
Molecular Details. J Phys Chem B. 2017 Apr 
20;121(15):3483-3492 


+ Mann D, Teuber C, Tennigkeit SA, Schröter 


G, Gerwert K, Kötting C. Mechanism of the 
intrinsic arginine finger in heterotrimeric G 
proteins. PNAS, 2016, 50 (113), E8041-E8050 


+ Schröter, G. Mann, D., Kotting, C., Gerwert 


K. Integration of Fourier Transform Infrared 
Spectroscopy, Fluorescence Spectroscopy, 
Steady-state Kinetics and Molecular 
Dynamics Simulations of Gail Distinguishes 
between the GTP Hydrolysis and GDP 
Release Mechanism J.Biol.Chem,, 2015, 
290(28) 


+ Yang C, Niedieker D, GroBertischkamp 


F, Horn M, Tannapfel A, Kallenbach- 
Thieltges A, Gerwert K, Mosig A. Fully 
automated registration of vibrational 
microspectroscopic images in histologically 
stained tissue sections. BMC Bioinformatics, 
2015, 16, 396-409 


+ Rudack T, Jenrich S, Brucker S, Vetter IR, 


Gerwert K, Kötting C. Catalysis of GTP 
hydrolysis by small GTPases at atomic detail 
by integration of X-ray crystallography, 


experimental. and theoretical IR 
spectroscopy. J. Biol. Chem., 2015, 40 (290) 
24079-24090\ 


+ Kuhne J, Eisenhauer K, Ritter E, 


Hegemann P, Gerwert K, Bartl F. Early 
Formation of the lon-Conducting Pore in 
Channelrhodopsin-2. Angew. Chem. Int. Ed, 
2015, 54, 4953-4957 


+ Wolf S, Grünewald S. Sequence, 


Structure and Ligand Binding Evolution 

of Rhodopsin-Like G Protein-Coupled 
Receptors: A Crystal Structure-Based 
Phylogenetic Analysis. PLOS ONE, 2015, 10(4), 
e0123533 


+ Wolf S, Freier E, Cui Q, Gerwert K. Infrared 


spectral marker bands characterizing a 
transient water wire inside a hydrophobic 
membrane protein. J. Chem. Phys., 2014, 141, 
22D524. 


+ Wolf S, Freier E, Gerwert K. A Delocalized 


Proton-Binding Site within a membrane 
Protein. Biophysical Journal, 2014, 107, 
174-184 


+ Krauß SD, Petersen D, Niedieker D, Fricke 


|, Freier E, El-Mashtoly SF, Gerwert K. and 
Mosig A. Colocalization of fluorescence 
and Raman microscopic images for the 
identification of subcellular compartments: 
a validation study. Analyst, 2014, 140(7), 
pp.2360-2368 


Cooperation 


+ QM/MM simulations on proteins. Frauke 


Grater, PICB, Shanghai; Qiang Cui, University 
of Wisconsin, Madison, Wisconsin, USA. 


+ GPCR structure and function. Stefan 
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Grunewald, PICB Shanghai; Zhao Qiang and 
Wu Beili, SIMM, Shanghai. 


+ High-performance Computing cluster. 
Prof. Eckhard Hofmann, Department 
of Biophysics, Ruhr University Bochum, 
Germany. 


+ Evolutionary Patterns of non-coding 
RNA. Prof. Peter F. Stadler, Interdisciplinary 
Center for Bioinformatics, University of 
Leipzig, Germany; Prof. Ivo Hofacker, 
Institute for Theoretical Chemistry and 
Structural Biology, University of Vienna, 
Austria 


+ Cell Tracking for Life Cell Imaging Data. 
Prof. Kannapan Palaniappan, Departement 
of Computer Science, University of 
Missouri-Columbia, USA; Dr. Jiu-Lin Du, 
Lab of Sensory Integration and Behavior, 
Institute for Neuroscience, Shanghai 
Institutes for Biological Sciences, Shanghai, 
China. 


+ Signal Processing Methods for Analyzing 
In Vivo Flow Cytometer Data. Prof. Xunbin 
Wei, Institute for Biomedical Sciences, 
Fudan University, Shanghai, China, and Prof. 
Michael Clausen, Institute for Computer 
Science, University of Bonn, Germany. 


Teaching(2015-2017) 


+ Regular teaching obligation by Klaus Gerwert 
at Bochum University. 

+ Regular teaching obligation Axel Mosig at 
Ruhr University Bochum (lectures, seminars 
and practicals on Bioinformatics). 


+ Teaching by Steffen Wolf at PICB. 


External Funding 


Active: 


Validation of candidates for protein 
biomarkers using the Protein Research Unit 
Ruhr Within Europe (PURE) 

State of Northrhine Wesphalia, Germany 
05/15/15 - 12/31/17 


Optogenetics SSP1926 

German Research Foundation (DFG), 
Germany 

07/01/16 - 06/30/19 


Heterotrimeric G proteins 

German Research Foundation (DFG), 
Germany 

01/01/17 — 12/31/19 


Biomarker-based follow-up of patients 
suffering from non-muscle-invasive low/ 
intermediate-risk bladder cancer (UroFollow) 
German Social Accident Insurance (DGUV), 
Germany 

01/01/16 — 12/31/18 


Verification of new molecular markers for the 
early diagnosis of asbestos associated lung 
tumors (Lung ll) 

German Social Accident Insurance (DGUV), 
Germany 

01/01/17 = 12/31/18 


Completed: 


GTP- and ATP-dependent Membrane 
Processes, Collaborative Research Center SFB 
642 

German Research Foundation (DFG), 
Germany 
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07/01/08 — 06/30/16 (in three phases, each 
4 years long). 


Protein Research Unit Ruhr Within Europe 
(PURE) 

State of Northrhine Wesphalia, Germany 
01/01/10 — 12/31/14 


Development of methods in protein analysis 
for the identification of candidate markers 

to facilitate the (early) diagnosis of asbestos- 
associated tumors of the lung and pleura 
(Lung l) 

German Social Accident Insurance (DGUV), 
Germany 

03/01/13 - 02/28/15 


Center for Vibrational Microscopy (CVM) 
Research Center Julich, Germany 
12/01/09 — 12/31/12 


Fellowship of the Mercator Group 
Mercator Stiftung, Germany 
01/01/11 — 12/31/11. 


MD simulations on GPCRs for improvement of 
GPCR protein crystallization, 

National Natural Science Foundation of 
China, Grant No. Y23DC31611 


SHENC: Research Unit “Shear flow regulation 
of hemostatis — Bridging the gap between 
nanomachanics and clinical presentation’, 
TP C3: Structural investigations of VWF under 
flow by FTIR spectroscopy 

German Research Foundation (DFG), 
Germany 


01/01/12 — 12/31/14 


Invited Plenary Talks by 
Klaus Gerwert since 2015 


+ 12/2016 CTAD 2016, San Diego, USA 


+ 10/2016 International Bladder Cancer 


Network, Bochum, Germany 


+ 10/2016 International Conference on 


Retinal Proteins (ICRP 2016), Potsdam, 
Germany 


+ 09/2016 SciX 2016, Minneapolis, USA 
+ 07/2016 K4DD Meeting, Basel, Switzerland 


+ 07/2016 “Pathologie im Ziegelbau’, 


Bamberg, Germany 


+ 06/2016 SPEC 2016, Montreal, Canada 


+ 04/2016 SPIE Photonics Europe Conference, 


Brussel, Belgium 


+ 03/2016 Faraday Discussion on Advanced 


Vibrational Spectroscopy for Biomedical 
Applications, Cambridge, UK 


+ 01/2016 Workshop “Modern methods 


for optical analysis in medicine’, Freiburg, 
Germany 


+ 01/2016 Chemical Colloquium of Friedrich- 


Schiller University, Jena, Germany 


+ 01/2016 51% Winterseminar on Biopyhsical 


Chemistry, Molecular Biology and 
Cybernetics of cell functions, Klosters. 
Switzerland 


+ 12/2015 Workshop at Roche, Basel, 


Switzerland 


+ 11/2015 Colloquium “Pioneers in Cell 


Dynamics and Imaging): Label-free FTIR 
and Raman imaging and applications in 
precision medicine”, Münster, Germany 


+ 07/2015 Colloquium at University of 


Krakow; Krakow, Poland 


+ 07/2015 RAS Structural workshop at the 


Frederick National Lab for Cancer Research 
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(FNLCR), Frederick, USA 
+ 07/2015 10th European Biophysics 
Congress (EBSA 2015), Dresden, Germany 


+ 07/2015 8" International Conference on 


Advanced Vibrational Spectroscopy (ICAVS 
2015), Vienna, Austria 


+ 07/2015 Witec Workshop at ICAVS 2015, 
Vienna, Austria 

+ 07/2015 Agilent Workshop at ICAVS 2015, 
Vienna, Austria 

+ 07/2015 ISAS-Colloquium, Berlin, Germany 


+ 06/2015 XVIIth International Conference on 


Time-Resolved Vibrational Spectroscopy 
(TRVS 2015), Madison, USA 


+ 05/2015 ICOB2015 Congress, Firence, Italy 


+ 05/2015 DKFZ Heidelberg, Heidelberg, 
Germany 
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4.2 Phylogenetic Combinatorics 


Researchers: 


Dr. Stefan Griinewald Students: 


Group Leader 
Phone: +86-21-5492 0459 
Email: Stefan@picb.ac.cn 


Dr. Jialiang Yang (until March 2014) 


Staff: 


Dr. Jialiang Yang (Senior Staff 
Scientist, 2008 — 2014) 


Dr. Zhisong He (Postdoctoral 


Research 
Overview 


The main research area of the group is 
phylogenetics. We develop and implement 
methods to reconstruct phylogenetic trees 

or networks, with an emphasis on displaying 
ambiguity in the data and reticulate evolution. 
In addition, we work on combinatorial 
problems that arise in the context of 
phylogenetics. The structures that we would 
like to construct from biological data can 
often be described in combinatorial terms. 
For example, an unrooted phylogenetic 
network can be considered as a collection of 
bipartitions (splits) of the taxa set and a rooted 
phylogenetic network is an acyclic directed 
graph. Compared with the combinatorics 

of trees, there are many open problems for 


more general networks. The solutions of 
such combinatorial problems are theorems, 
and they are published in mathematical 


journals. Many of the results have algorithmic 


consequences and give rise to new methods 
or insights in the performance or limitations of 
existing methods. 


There are many more problems in biology 


Fellow, since 2016) 


Pei Wu (2009 - 2016) 

Libo Huang (since 2011) 
Weikang Gong (since 2074) 
Mengzhen Guo (since 2016) 
Faming Chen (since 2017) 


and beyond, where phylogenetic methods 

or similar combinatorial techniques can be 
applied. In addition to “proper” phylogenetic 
analysis, we have worked on the alignment of 
biological networks such as protein-protein 
interaction networks or gene regulatory 
networks. Further, we have used phylogenetic 
trees and networks to investigate the 
evolutionary history of an important protein 
family, as well as to analyze ancient Chinese 
characters. 


Current State of Research 


Phylogenetic Network 
Reconstruction 


The most widely used methods to construct 
phylogenetic networks are NeighborNet 

and Split Decomposition. They both start 
with pairwise distances between the taxa 

of interest and then compute a collection of 
weighted splits (bipartitions). The number of 
splits is bounded by the size of the input, that 
is the number of unordered pairs of taxa. 


Pairwise distances have been used for a long 
time to reconstruct phylogenetic trees and 
networks. They are easy to compute from 
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Figure 1. A phylogenetic network on 25 Squamata species and six outgroups. 


all kinds of raw data, and distance-based 
methods, especially Neighbour Joining 

have been found to be surprisingly robust. 
However, the pending edges in phylogenetic 
networks tend to be longer than the interior 
ones, thus errors estimating their length can 
have a high impact. Since only the non-trivial 
splits with at least two taxa on each side are 
relevant for classification, we prefer weighted 
quartets to distances. They quantify the 
separation between two pairs of taxa and can 


be seen as the smallest building block for non- 
trivial splits. Further, there is less information 


reduction when the raw data 
into quartet weights, which al 
consistently reconstruct more 


is transformed 


ows us to 
general classes 


of split systems than via distances. 


We published QNet and Quartet-Net, 

quartet analogues of NeighborNet and Split 
Decomposition, in 2007 and 2013, respectively. 
While QNet constructs circular split systems, 
the same class as NeighbourNet, Quartet- 

Net can reconstruct a strictly larger class 

of split systems than Split Decomposition. 


However, both methods tend to produce 
quite similar networks to their distance-based 
analogues. We therefore developed one more 
method which can consistently reconstruct 
all split systems that do not contain all ten 
possible (3,3)-splits for any six taxa. The size 
of such split systems can be of the same 
order of magnitude as the number of all 
quartets (4"” power of the number of taxa), 
thus our method is the first one that takes 

full advantage of the input size and does not 
follow the approach of any distance-based 
method. 


The quality of a phylogenetic network using 

a quartet-based method heavily depends 

on how the quartet weights are computed 
from the raw data. This raw data is often a 
multiple sequence alignment, and the existing 
methods either use a simple pattern counting 
approach or use a Markov model that assumes 
that one of the three quartet trees is correct 


oa 


for every four taxa. We have developed a new 


method using the Hadamard conjugation. 
This classical concept can reconstruct arbitrary 
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split systems under some models, for example 
Kimura's three substitution types (K3ST) 
model. It has not been used much in practice 
because it is slow for large taxa sets and it 

is not consistent under complex models. 
However, for computing quartet weights the 
computation is feasible and the restriction to 
simple models avoids over-fitting. Our new 
method to construct phylogenetic networks, 
together with the Hadamard-based quartet 
weights, has produced encouraging results. 
For example, we get an unusually tree-like and 
resolved network for a data set consisting of 
mitochondrial genomes of some squamata 
species and outgroups. 


Combinatorial Results 


The quartet distance is a way to compare 
two different binary phylogenetic trees on 
the same taxa set. It counts the number of 
4-sets for which the trees induce different 
quartets. The mean value and the variance 

of the distance have been known for a long 
time, and the measure is frequently used for 
various applications. However, one question 
about the diameter has been open for more 
than 30 years. In 1986, Bandelt and Dress 
observed that the maximum possible distance 
for n taxa, divided by the number of all sets of 
four taxa, is monotonically decreasing. They 
conjectured that the limit of this ratio is 2/3, 
which corresponds to the distance between 
two random trees. We now have a complete 
of this so-called 2/3-Conjecture. We expect 
this proof to not only settle a long-standing 
mathematical problem but also to have some 
impact within and beyond phylogenetics. On 
the one hand, we suggest a weighted version 
of the quartet distance that seems to be 


useful for tree comparison, and on the other 
hand, a special case of our theorem gives us 
a new way to measure the dissimilarity of two 
permutations. This might be useful for rank 
Statistics. 


As a byproduct of our development 

of methods to construct phylogenetic 
networks, we study split and set systems 

that correspond to an almost hierarchical 
clustering. One example is the class of 
2-weakly compatible split systems that we 
introduced when we developed Quartet- 
Net. It is defined by a forbidden configuration 
on six taxa and contains precisely all split 
systems that Quartet-Net can reconstruct 
consistently. In contrast to weakly compatible 
split systems, it is not easy to calculate the 
maximum cardinality of a 2-weakly compatible 
split system with n taxa. When we published 


the method, we only had a quadratic lower 
bound and a cubic upper bound. Using a 
new graph theoretical method, we have now 
improved the upper bound to O(n2°). This 
means that the split systems that Quartet- 


Net can reconstruct are not as big as we have 
hoped, but it also means that the worst case 


running time is faster than expected. The 
newly developed proof techniques might be 
useful to better understand other classes of 
set and split systems. 


The last combinatorial result to be mentioned 
here deals with missing information. It is a 
classical result that a phylogenetic tree can be 
reconstructed from the pairwise distance, and 
itis easy to find examples that usually not all 
of the distances are needed. This observation 
naturally gives rise to the question whether 
the distance information on a given subset 
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of the pairs of taxa suffices to reconstruct a 
phylogenetic tree. This question has been 
studied systematically by several collaborators 
of our group. They had posted the open 
problem whether a binary phylogenetic tree 
can always be reconstructed, if for every 
interior vertex v there are three taxa with 
median v such that the pairwise distances 
between those three taxa are known. We 
recently found a surprisingly easy proof of this 


conjecture, and its publication is under review. 


Collaborative Projects outside 
Phylogenetics 


etd to pepOde corfact vourre 
| iouenc snail Agand tindang 


In collaboration with Steffen Wolf, a former 
postdoc in the Department of Biophysics, we 
used phylogenetic networks to investigate 
the evolutionary history of G protein-coupled 
receptors (GPCRs). They form the largest 
family of membrane receptors in the human 
genome. We used their amino acid sequences 
as well as the structure of 24 receptors 

that had been determined by x-ray protein 
crystallization, to gain new insights in how 
GPCRs should be classified and to propose 

a scenario how various classes of GPCRs and 


their function have evolved. 


Figure 2. An evolutionary scenario for the evolution of some classes of GPCRs. 


Phylogenetic networks have been used 
frequently to analyze language evolution. For 
a different application in linguistics, we joined 
former PICB members Andreas Dress and 
Zhenbing Zeng to analyze ancient Chinese 
characters. We constructed a phylogenetic 
network on more than 3000 years old oracle- 


bone characters related to animals. The 
network indicates how the ancient Chinese 
might have perceived the ‘animal world’ in the 
late bronze age. 


We also joined a project analyzing the 
transcriprome of neocortical layers in humans, 
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chimpanzees, and macaques. It was led by 
Philipp Khaitovich from the Comparative 
Biology Group and resulted in a publication in 
Nature Neoroscience. 


Future Perspective 


Currently, we have a long queue of 
unpublished results, including several ones 
mentioned in the previous section. The main 
short-term goal is to conclude these projects, 
which will significantly improve our recent 
publication record. We will then adjust our 
main research topics. 


Phylogenetic Networks 


We have no plans to develop more quartet- 
based methods to reconstruct phylogenetic 
networks. Instead, we are going to focus 

on detecting situations where our methods 
have an advantage compared to the more 
commonly used algorithms. We plan to 
systematically look for patterns where 
competing methods make errors that can 

be avoided by using quartets. Then we hope 
to find real data sets where those patterns 
occur. In addition, we will re-analyse some 
data sets where the tree reconstruction 
methods have produced contradicting 
outputs. For example, two recent publications 
claimed to have resolved the phylogenetic 
tree of birds, yet their trees are incompatible. 
Other data sets show conflicting signals 
between mitochondrial and nuclear genes. 
Finally, we are interested to compute quartet 
weights from other raw data than sequence 
alignments. This could be protein structures or 
any other data that has been used to calculate 
evolutionary distances. 


Combinatorial Problems 


Our proof of the 2/3-Conjecture motivates us 
to study several related questions. A natural 
generalization of the quartet distance is to 
compare more than two trees. Since the 
quartet distance can be considered as a 
measure for the dependence between two 
branching processes, a higher-dimensional 
version would make it possible to pick up 
multi-way dependencies, even when every 
pair of trees is almost independent. Another 


related open question for comparing two 
trees with identical taxa set is the biggest 
common subtree problem. It is still unknown 
how fast the minimum size of such a subtree 
grows with the number of taxa. We hope that 
the new methodology that we developed to 
find our result on the quartet distance will also 
be helpful for this question. 


We will continue to investigate classes of 
almost compatible split systems. We found 
our recent result on 2-weakly compatible split 
systems by associating a particular graph with 
every split system. This new tool might also be 
helpful to better understand other classes of 
split systems whose maximum cardinality has 
been unknown for a long time. 


Cell Differentiation 


Phylogenetic methods have been used 

to analyze different cell types in healthy 

tissue as well as in cancer tumors. Since 

cell differentiation is a process where a cell 
becomes more specialized, there is a rooted 
phylogenetic tree where the cell types are the 
vertices. The root represents pluripotent stem 
cells and the leaves are the fully differentiated 
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cell types. The increasing quality of data that 
can be used to reconstruct such trees, as well 
as the high relevance, especially of intra-tumor 
phylogeny, makes this an exciting application 
of phylogenetics. We hope that our non- 
mainstream methods will be useful here. At 
our institute, the Computational Systems 
Genomics group has been developing a 
method that can use transcriptomic data, 
together with protein interaction networks, 
to estimate the differentiation potential of cell 
samples. Their measure of network entropy 

is related to cell-type phylogeny, and we are 


currently setting up a collaboration in the area. 


General Information 


Publications (2014-2017) 


- Yang J, Grünewald S, Xu Y, Wan, X F. 
Quartet-based methods to reconstruct 
phylogenetic networks. BMC Systems 
Biology. 2014. 8(1), 21. 


e Wolf S, Grünewald S. Sequence, structure 
and ligand binding evolution of rhodopsin- 
like G protein-coupled receptors: a crystal 
structure-based phylogenetic analysis. PloS 
One. 2015. 10(4), e0123533. 


+ Dress A, Grünewald S, Zeng Z. A cognitive 
network for oracle-bone characters 
related to animals. In: Phua K K, Ge M (Eds). 
Peregrinations from Physics to Phylogeny: 
Essays on the Occasion of Hao Bailin’s 80th 
Birthday. World Scientific. 


+ He Z, Han D, Efimova O, Guijarro P, 
Yu Q, Oleksiak A, Jiang S, Anokhin K, 
Velichkovsky B, Grünewald S, Khaitovich P. 
Comprehensive transcriptome analysis of 


neocortical layers in humans, chimpanzees 
and macaques. Nature Neuroscience. 
2017. 20(6), 886-895. 


Cooperation 


+ Phylogenetics, Prof. Mike Steel and Prof. 
Charles Semple, University of Canterbury 
Christchurch, New Zealand. 


+ Reconstructing phylogenetic trees from 
partial distances, Prof. Vincent Moulton and 
Prof. Katharina Huber, University of East 
Anglia, Norwich, UK. 


+ Almost compatible split and set systems, Prof. 
Jacobus Koolen, University of Science and 
Technology of China, Hefei, China. 


External Funding 


+ Biological network alignment, National 
Science Foundation of China, Grant No. 
10971213. 


Teaching (2014-2017) 


+ Contributed annually to the course Data 
Structures and Algorithms 


+ Taught a course Combinatorial Methods 
in Computational Biology at Shanghai 
University in 2016. 


Invited Talks (2014-2017) 


+ Invited Speaker, The Quartet Distance 
between Phylogenetic Trees and an 
Application Beyond Phylogenetics, Waiheke 
2014: The 18™ Annual New Zealand 
Phylogenomics Meeting, Waiheke, New 
Zealand, Feb 2 - Feb 7, 2014. 
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+ Invited Speaker, Phylogenetic Lassos and 
the Triple Conjecture, Portobello 2015: The 
19" Annual New Zealand Phylogenomics 
Meeting, Portobello, New Zealand, Feb 1 - 
Feb 6, 2015. 


+ Invited Speaker, Phylogenetic Lassos and 
the Triple Cover Conjecture, 2015 Workshop 
on Combinatorics and Applications at SJTU, 
Shanghai, China, April 21- April27, 2015. 
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4.3 Computational Genomics 


Researchers: Science 


Dr. Yungang He 


Junior Principle Investigator Fax: 


Phone: +86-21-54920469 
Fax: +86-21-54920451 


Email: yunganghe@picb.accn Students 


Xin Huang 
Prof. Dr. Li Jin 
Adjunct Principle Investigator 
Member of Chinese Academic of 


Research 
Overview 


The group for computational genomics is 
founded by Dr. Li Jin since he worked at 
PICB as one of the founders in 2005. Prof. 

Jin is a faculty member at Fudan University, 
and currently works as an Adjunct Principal 
Investigator at PICB in a transitional 
arrangement. Dr. Yungang He is now leading 
research and administrative activities in this 
group. 


The group for computational genomics 
has interests on broad topics in population 
tional biology. The 


research of this group is currently focused 


genetics and computa 


on two areas: 1) Revealing evolutionary 
mechanisms for genetic variants in human 


populations; 2) Applying computational 
approaches on genetic issues for human 
health. 


1) Revealing evolutionary 
mechanisms for genetic variants in 
human populations 


Phone: +86-21-54920455 
+86-21-54920451 
Email: lijin.picb@gmail.com 


Current State of Research 


Project 1.1 A probabilistic method — bridging 
the gap between researches of natural 
selection and association studies for complex 
traits 


Darwinian theory states that species arise and 
develop through natural selection of inherited 
variations that increase the individual's 
adaptation to circumstance. It is important to 
understand differences in natural selections 
between human populations. Although 


the computer simulation of genetic drift 


and selection can be helpful to compare 
selections between populations, it is unlikely 
that the “real” population genetic history can 
be confidently represented in the simulations. 
To eliminate false-positive results, the timing 
and intelligence costs become unaffordable 
when a large amount of possible evolutionary 
scenarios need to be exhaustively simulated. It 
is therefore critical to develop a powerful and 
reliable statistical method for the genome- 
scale analysis of selection differences. 


We reported the development of a 
probabilistic method for testing and 


PI INDEPENDENT RESEARCH GROUPS 


138 


estimating selection differences between 
populations (He et al. 2015). Using a 
probabilistic model of genetic drift and 
selection, we showed that logarithm odds 
ratios of allele frequencies provide estimates 
of the differences in selection coefficients 
between populations. The estimates 
approximate a normal distribution and 
variance can be estimated using genome-wide 
variants. This allows us to quantify differences 
in selection coefficients and to determine 

the confidence intervals of the estimate. This 
method was applied to a genome-wide data 
analysis of Han and Tibetan populations. The 
results confirmed that both EPAS7 and EGLN1 
genes are under statistically different selection 
in Han and Tibetan populations. 


Our study disclosed a beautiful connection 
between the selection detection and the 
classical genetic association. We measured 
the differences in allele frequencies between 
populations using their logarithm odds ratios, 
while genetic association studies usually 
present the effect size of risk alleles in odds 
ratios with estimated confidence intervals. This 
connection provides a great advantage for 
determining the technical details of a research 
design, especially for determining sample 
sizes. For example, the statistical power in 

our approach can be easily calculated for a 
specified study design. More important, our 
study bridges a gap for researchers between 
natural selection and genetic association of 
complex traits. 


RO 
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Figure 1. Population specific differences in selection 
pressures on human pigmentation. We obtained the 
overall difference by summarizing selection differences on 31 
loci for population pairs. The difference is defined as As;j= $; 


- 5, where s;and 5;is the selection coefficients of population 


iand j. The abbreviation of population i was presented in 
row while the abbreviation of population j was presented 
in column. The populations are abbreviated as follows: 
WAF, West Africans; EAF, East Africans; EUR, Europeans; SIB, 
Siberians; EAS, East Asians. 


Project 1.2 Dissecting historical changes of 
selective pressures in the evolution of human 
pigmentation 


Human pigmentation is a highly diverse trait 
among populations, and has drawn particular 
attention from both academic investigators 
and public for thousands of years. Although 
recent studies have detected signals of natural 
selection in multiple pigmentation genes, 
none of the studies have investigated the 


historical changes of selective pressures during 


different epochs and quantitatively compared 
the differences of selections between different 
populations. 


In the present study, we developed a new 
approach to dissect historical changes of the 
selective pressure in a genetic model with 
multiple populations (Huang et al., under 


PI INDEPENDENT RESEARCH GROUPS 


review). Our theoretical analysis showed the Furthermore, our evolutionary analysis on 
natural selection on a complex trait can be separate loci suggests that epistasis plays 
measured by summarizing selection effects important roles in the diversifying selection 
on all known genetic loci. We collected on human pigmentation among populations. 
genotype data of 31 critical loci for human 

pigmentation from 15 public datasets, and This is the first study where historical changes 
obtained data for 3399 individuals of 5 of selective pressure are separately presented 
representative populations from worldwide. for different epochs in a multiple-population 
Our new approach revealed not only a recent model. This work had been believed as a 
incremental change of selective pressure (0.68 hard job because the genetic information 

x 10°/generation) in modern Europeans, but from a plenty of ancient genomes is difficult 
also a significant historical increase of selective to obtain. Since worldwide populations have 
pressure (1.78 x 10°°/generation) on light never stopped to adapt to frequent changes 
pigmentation shared by all Eurasians during of local environments, our study is important 
the out-of-Africa event. We favored diversifying as a paradigm for future researches to the 
selection as the single explanation for the continuous evolution of complex traits. 


cause of light pigmentation in Eurasians. 
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So 
8 i Sa = So + Ös + Öz i 
i 3. Europeans 
j S4 = So + Og + Og + 6, | fac ck 
(oJ | aimi 4 = OSiberians 
i y 6 i i 
i Sg = Sy + Og | ó i 
S=% +5, +5, 7 i 
i PEPEE IP Oe Q East Asians 
f Eh i 
H t 
ie— h; 
~3600 ~3000 ~2000 0 generations 


before present 


Figure 2. Modeling historical selective coefficients for the evolution of five representative human 
populations. Here, s; (i = 0, /,..., 8) denotes the selection coefficient of the i-th epoch. 6; (i = 1, 2, ..., 8) denotes 
the selection (coefficient) change of the i-th epoch, and can be obtained by estimating selection (coefficient) 
differences between paired populations. The numbers on the branches indicate different evolutionary epochs. 
We assumed that the divergence time of separation between Africans and Eurasians was ~3600 generations 
before present (BP); the divergence time of separation between Europeans and Asians was ~3000 generations 
BP; the divergence time of separation between Siberians and East Asians was ~2000 generations BP; and the 
divergence time of separation between East and West Africans was ~2000 generations BP. 


139 


PI INDEPENDENT RESEARCH GROUPS 


140 


Future Perspective 


When anatomically modern humans emerged 
from Africa, and subsequently colonized 
throughout the world, they encountered many 
challenges, including essential environmental 
alterations, food resource shifts, and infectious 
diseases. The current huge size and wide 
distribution of modern human population 


Wn 


demonstrates the evolutionary success of 
human beings, which intrigues and attract 


Wn 


geneticists to investigate the natural selection 
and genetic adaptation of human populations. 
In recent years, genetic alterations under 


recent selection have attracted more attention 
than ever before. Consequently, some highly 
irregular genetic variants were discovered and 


further explored using various approaches. 
However, most of the published studies 
focus on genetic variants with relatively high 
frequencies in one or more populations. 
The dynamics and selection of rare variants 
are not well understood yet. Recent studies 
revealed that rare variants contribute a lot to 
human complex diseases and physical traits, 
such as obesity, body size and shapes, etc.. It 
is therefore critical to explore the population 


genetics of rare variants in human populations. 
Advanced sequencing technology supplied 

us an opportunity to look into the rare variants 
in an affordable cost. More powerful methods 
and more sophistic genetic models will be 
allowed to apply on the accumulating high- 
quality sequences from a broad range of 
populations. We expect to disclose more 
details about population genetics of human 
rare variants in the next few years. 


2) Applying computational 
approaches on genetic issues for 
human health 


Current State of Research 


Project 2.1 Collaboration research on the 
evolution of drug resistance of hepatitis B 
virus in a clinical trail 


Hepatitis B virus (HBV) is an important 
infectious agent that causes billions of 
infected people around the world. Our team 
was invited to work on the evolution of drug 
resistance in a phase IV, 2-year, multicentre, 
randomized, controlled trial (EFFORT study, 
NCT00962533). 


Reverse Transcriptase (RT) mutation 
contributes to hepatitis B virus resistance 
during antiviral therapy of nucleos(t)ide 
analogues. However, the composition of 
the RT quasispecies and their interactions 
during antiviral treatment is yet to be 
thoroughly defined. In our study, ten 


patients from each of 3 different virological 
response groups (i.e. complete virological 
response, Partial virological response and 
virological breakthrough) were selected from 
a multicentre trial treated with Telbivudine. 
Variations of drug-resistance related critical 


RT regions of 107 serial serum samples from 
the 30 patients were examined by ultra-deep 
sequencing. 


Sequencing results revealed different 
dynamics of nonsynonymous mutations, such 
as SLOP, sN40S, sG44E, sW172*, sW182*, and 
sS187F, between patients with a complete 
virologic response and those with a partial 
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virologic response. The viral population 
heterogeneity decreased at week 12 of LdT 
treatment in patients with a complete virologic 
response, with a concomitant decline in 
nonsynonymous mutations (from an average 
of 14 to 9.9 per sample) and an increase in the 
frequencies of major variants (from 14.3% to 
40.4%). Our findings suggest that the decrease 
in viral population heterogeneity at an early 
stage of LdT treatment was associated with 
the subsequent optimal virologic response 
(Dong etal, 2016). 


Phylogenies of the quasispecies revealed 
independent origins of two critical 
quasispecies, i.e. the rtA181T and rtM2041 
mutants. Data analyses and theoretical 
modeling showed a cooperative-competitive 
interplay among quasispecies. In particular, 
rtM204I mutants compete against other 
quasispecies and eventually lead to virological 
breakthrough. However, in the absence of 
rtM204I mutants, synergistic growth of the 
drug-resistant rtA181T mutants with the wild 
type quasispecies could drive the composition 


of viral population into a state of partial 
virological response (Zhou et al, 2015). 


Figure 3. Phylogeny of HBVs from a patient with typical virological breakthrough. The lineages carrying 
rt181T and rt204I mutations were marked in blue and red, respectively. The time points that lineages were observed 
were shown on different rims of the circles in different colors. 


Project 2.2 Discovering roles of the gene 
network topology in the functional robustness 
of genetic alternatives 


The biologic molecules with their physical or 
logical relationships can be represented as 
biologic networks where nodes indicate the 
molecules and edges imply the relationships. 


In evolutionary history, the topology of 

a biologic network underwent frequent 
updates to adapt to varying environments. 
Two separate but collaborative processes 

can address the topologic evolution of gene- 
related networks. First, nodes of biologic 
networks can increase or decrease due to 
the appearance or disappearance of genes; 
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second, edges between nodes can be added 
or removed, representing the functional 
differentiation of genes. 


We investigate the functional robustness of 
genetic alternatives by studying the loss of 
duplicated genes in yeast. Whole genome 
duplication (WGD) with the subsequent 
large-scale network rewiring is the most 
prominent cause of the evolution of biological 
networks. WGDs have occurred multiple 
times in evolutionary history, where over 
90% of eukaryotic genes are products of 
serial gene duplication. It is therefore critical 
to understand the genetic mechanisms of a 
biologic network through the appearance 
and disappearance of new genes, especially 
how the ancestral topology of the biologic 
network affects disappearance of duplicated 


A 
Fitness 


t; 


gene copies after WGD. 


By studying the loss of duplicated gene 
copies in yeast with multi-source data, we 
demonstrate that both robustness of fitness 
to gene dosage and purifying Darwinian 
selection play critical roles in determining 
the fate of duplicated copies, and hence 


contribute to the evolution of the topology 
of the yeast metabolic network. Our work 
reveals a connection between fitness of 


genetic alternatives and the topologic features 
of the biologic network, and suggests a 
topologically dependent mechanism of 
network evolution (He et al., under review). 
This discovery will promote our efforts for 
evaluating malfunctional variants using a 
novel computational approach. 


Figure 4. Our hypothesis about the fitness damage because of WGD and subsequent recoveries due to 
the loss of extra copies. The boxes indicate genes on the same or different chromosomes. We assume that the 
gene duplication happens at time t, and the gene loss occurs at time t, and t;. The duplication of the green gene 
has little damage to individual's fitness. The extra copies of yellow and red genes significantly damage the fitness 
and therefore lead to the losses of extra copies in the continues evolution. 
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Future Perspective 


Few people realized population genetics is 
critical for health sciences, even when the 
public learned that molecular biology and 
medical genetics are very important. The 
truth is we can contribute a lot to current 
researches on health issues. In the past years, 
we already contributed several works that 
promote the progress of health sciences in 
China. For example, our study about linkage 
disequilibrium in East Asian promoted the 
progress of hundreds of genome-wide 
association studies in Chinese populations. 
We also worked with clinical doctors to well 
understand the development of HBV drug 
resistance in a clinical trial. It is not doubted 


that we can do more when the personal 
genome sequencing becomes much more 
popular in near future. One of the keys in 

the personal medicine is to understand how 
unusual a novel rare genetic variant is in both 
general populations and patient groups. There 
is a natural relation between this key issue 
with our multiple ongoing or finished projects. 
We therefore expect to make significant 
contribution to these issues in the established 
globe collaborations. 


General Information 


Publications (*co-first authors, 
tco-corresponding authors) 


+ Dong H*, Zhou B*, Kang H%, Jin W*, Zhu Y, 
Shen Y, Sun J, Wang S, Zhao G, Hou J, He 
Yt. Small surface antigen variants of HBV 
associated with responses to telbivudine 
treatment in chronic hepatitis B patients. 


Antivir Ther. 2016. [Epub ahead of print] 


+ Xu H*, Yang Y*, Wang S, Zhu R, Qiu T, Qiu 


J, Zhang Q, Jin L, He Yt, Tang Kt, Cao Z. 
Predicting the Mutating Distribution at 
Antigenic Sites of the Influenza Virus. Sci 
Rep. 2016;6:20239. 


+ Zhou B* Dong H*, He Y*, Sun J*, Jin W*, Xie 


Q, Fan R, Wang M, LiR, Chen Y, Xie S, Shen 
Y, Huang X, Wang S, Lu F, Jia J, Zhuang 

H, Locarnini S, Zhao GPt, Jin LË, Hou Jt. 
Composition and Interactions of Hepatitis B 
Virus Quasispecies Defined the Virological 
Response During Telbivudine Therapy. Sci 
Rep. 2015;5:117123. 


- He Yt, Wang M, Huang X, Li R, Xu H, Xu 


S, Jin LË. A probabilistic method for testing 
and estimating selection differences 
between populations. Genome Res. 
2015;25:1903-9. 


+ Hu Y, Ding Q, He Y, Xu S, Jin L. 


Reintroduction of a Homocysteine Level- 
Associated Allele into East Asians by 
Neanderthal Introgression. Mol Biol Evol. 
2015;32:3108-13. 


+ Xu H, Wang CC, Shrestha R, Wang LX, 


Zhang M, He Y, Kidd JR, Kidd KK, Jin L, 
Li H. Inferring population structure and 
demographic history using Y-STR data 
from worldwide populations. Mol Genet 
Genomics. 2015;290:141-50. 


Other Publications 


- Huang X, Wang S, Jin LË, He Yt. Dissecting 


historical changes of selective pressures 
in the evolution of human pigmentation. 
(Under review) 
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- He Yt, Huang X, Wang M, Li R, Jin L* Ro- External Funding 
bustness to dosage enables the dependent 
topologic evolution of network of yeast + Active: 
metabolic genes. (Under review) Outstanding Member Program 


Youth Innovation Promotion Association 
‘a ; (Membership ID No. 2012216), 
ooperation Chinese Academy of Sciences, China 


Principal Investigator: Yungang He 
01/01/2017 — 12/31/2019 


+ Evolution of human pathogens. Prof. 
Guoping Zhao, Institute of Plant Physiology 
and Ecology, SIBS, CAS, Shanghai, China; 


Prof. Jinlin Hou, Department of Infectious * Completed: 


“Developing Joint Estimation for Fitness 
of Genetic Variant in A Multi-population 
model: Method and its’ Application” 


National Natural Science Foundation of 
Hongyan Wang, Obstetrics & Gynecology China (Grant No. 91331109) 


Diseases, Nanfang Hospital, Southern 
Medical University, Guangzhou, China. 


+ Genetic basis for birth defects. Prof. 


Hospital, Fudan University, Shanghai, China. Principal Investigator: Yungang He 
- Missing data in statistical analysis. 01/01/2014 — 12/31/2016 
Assistant Professor of Biostatistics, Ye Shen, 
College of Public Health Sciences Campus, “Developing Investigation Methods for 
University of Georgia, Athens GA, US. Genetic Structure of Human Populations 


; , Utilizing Genome Sequences” 
+ Mathematics for phylogenetics. Lecturer 


in Computing Sciences, Taoyang Wu, 
School of Computing Sciences, University 
of East Anglia, Norwich, UK 


National Natural Science Foundation of 
China (Grant No. 31171279) 

Principal Investigator: Yungang He 
01/01/2012 — 12/31/2015 


+ Roles of rare variants in human 

complex traits. Prof. Sijia Wang, IRG Regular Program 

Dermatogenomics group, PICB, China. Youth Innovation Promotion Association 
(Membership ID 2012216), 
Chinese Academy of Sciences, China 


+ Natural selection in human populations. 


Prof. Shuhua Xu, IRG Population Genomics 
Principal Investigator: Yungang He 


01/01/2012 — 12/31/2015 


group, PICB, China. 


Teaching 


+ Human Evolutionary Genetics (Li Jin and 
Shuhua Xu); lecture for graduate students 
of SIBS. 
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4.4 Evolutionary Genomics 
Researchers: Students: 
Dr. Haipeng Li (Principal 
Investigator) 

Phone: +86-21-54920460 


Email: lihaipeng@picb.ac.cn Chen Ming 
Yuting Wang 
Staff: Zongfeng Yang 
Dr. Jinggong Xiangyu (Associate Wanile Mu 
Professor) Ziqian Hao 


Zhili Gu (Research Associate) Pengyuan Du 


Guangyi Dai (Research Associate)i 


Research 
Overview 


The Evolutionary Genomics group was 
established in September 2007 and is headed 
by Dr. Haipeng Li. To pursue our research 
interests, the group has maintained a small 
and very efficient theoretical research team 
(currently composed of 3 smart students and 
the PI). Other group members work on related 
data analysis, software development and 
experimental analysis. 


Since the last evaluation in 2014, we first 
expanded our formal approach to estimate the 
recombination rate that is very important in 
many fields. We considered the key features in 
the next-generation sequencing technologies 
and develop the FastEPRR software to analyze 
the genome-wide SNP data set. It is more 
than 300,000 times faster than LDhat, the very 
famous software developed by Gil McVean 
and his colleagues at the University of Oxford, 
while those two estimates have the same 
levels of accuracy. Second, using genetic data 
and mathematical modelling, we have found 


Kao Lin (graduated in 2013) 
Junrui Li (graduated in 2013) 
Feng Gao (graduated in 2016) 


that vertebrates all over the planet began to 
experience rapid population declines starting 
in the late 19" century, coinciding with the 
wide-spread industrialization and profound 
change of global living ecosystems over the 


past two centuries. 


|. Fast estimation of recombination 
rate by machine learning 


Current State of Research 


Genetic recombination is a very important 
evolutionary mechanism that mixes parental 
haplotypes and produces new raw material for 
organismal evolution. In living organisms, this 
process is highly regulated and, because its 
rate varies along the genome, much attention 
has been paid to identifying recombination 
hotspots. Increased knowledge about 
recombination will be useful for studies of 
linkage disequilibrium (LD), admixture, natural 
selection, and associated work on genetic 
diseases. Thus, information on recombination 
rates is critical for biological research. 


Recombination rates can be estimated 

by experimentally counting the number 

of such events during meiosis. However, 

the application of this approach is limited 
because of the extremely low frequency of 
recombination. This issue can be overcome 
on the one hand by sequencing a large 
number of parent-offspring pairs using a 
large amount of sperm from a single male. 
On the other, the number of recombination 
events that occurred in the past can be 
inferred via coalescent theory and population 
genetics; in this approach, population 
recombination rate is denoted as (=), where 
is the effective population size, and the 
recombination rate per generation. Over 
the last two decades, a number of methods 
that use likelihood models to estimate 
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recombination rates from intraspecific DNA 
polymorphism data have been proposed. 
Of these, full-likelinood methods, including 
importance sampling, Markov Chain Monte 
Carlo (MCMC), and Bayesian MCMC have 
proved the most accurate for estimating 

p. However, because full-likelinood 
approaches are very computationally 
expensive, even with moderately-sized 
data sets, a composite-likelinood method 
based on two-locus sampling probabilities 
was also proposed to estimate , and an 
improved approach is implemented in the 
LDhat software package. Although these 
composite-likelihood methods are relatively 
simpler computationally than full-likelinood 
approaches, calculations are still very time- 
consuming. 
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Figure 1. Recombination rates of chromosome 7 for three human populations of African (YRI), European (CEU) and 
East Asian (CHB) ancestry at a 50-kb scale. The cartoon at the bottom is a visualization of the chromosome. 
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In this project, we introduced a new extremely 
fast open-source software package (FastEPRR) 
that uses machine learning to estimate 
population recombination rate ©) from 
intraspecific DNA polymorphism data, where 
is the effective population size and the 
recombination rate. When and the number of 
sampled diploid individuals is large enough (), 


the variance of remains slightly smaller than 
that of . So it makes FastEPRR suitable to map 
recombination hotspots in different species. 
Especially, the new estimate (calculated by 
averaging and ) has the smallest variance of all 
cases. When estimating , the finite-site model 
was employed to analyze cases with a high 
rate of recurrent mutations, and an additional 
method is proposed to consider the effect of 
variable recombination rates within windows. 
Simulations encompassing a wide range 

of parameters demonstrate that different 
evolutionary factors, such as demography and 
selection, may not increase the false positive 
rate of recombination hotspots. 


Overall, accuracy of FastEPRR is similar to the 
well-known method, LDhat, but requires 

far less computation time. Genetic maps for 
each human population (YRI, CEU and CHB) 
extracted from the 1000 Genomes OMNI data 
set were obtained (Figure 1) in less than three 
days using just a single CPU core. The Pearson 
Pairwise correlation coefficient between the 
and maps is very high, ranging between 
0.929 and 0.987 at a 5-Mb scale. Considering 
that sample sizes for these kinds of data are 
increasing dramatically with advances in next- 
generation sequencing technologies, FastEPRR 
is expected to become a widely used tool 

for establishing genetic maps and studying 
recombination hotspots in the population 


genomic era. 


Future Perspective 


Inspired by our pioneer work in 2011 and our 
following works in 2013 and 2016, a number 
of well-known groups at UC San Diego, UC 
Berkeley and Rutgers University were also 
applying machine learning in (theoretical) 
population genetic analysis (Pybus, et al. 2015, 
Bioinformatics; Ronen, et al. 2013, Genetics; 
Schrider and Kern 2016, PLoS Genetics; 
Schrider, et al. 2015, Genetics; Sheehan and 
Song 2016, PLoS Comput Biol). So we will 
continue to lead this field by solving different 
questions of theoretical population genetics 
by using machine learning in the future. 


Il. Recent demography inference 
and conservation genetics 


Current State of Research 


The current rate of species extinction is ~1,000 
times the background rate of extinction, and is 
attributable to human impact, ecological and 
demographic fluctuations, and inbreeding due 
to small population sizes. While preservation 
of biodiversity is vital to a sustainable human 
society, rapid population decline (RPD) 
continues to be widespread across taxa. When 
RPD occurs, it is accompanied by a loss of 
genetic diversity. Genetic diversity is reflected 


in the genetic differences among individuals 
and is essential for populations to adapt to 
changing environments. The start date and 
the rate of RPD provide useful information 
for effective conservation of threatened 
species and are important for promotion 
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of public awareness of the threat. However, 
these two key parameters are difficult to 
estimate because there is virtually no time 
series data on population size over hundreds 
of years. Therefore, an alternative approach is 
to estimate the start date and the rate of RPD, 
using mathematical modeling. 


Changes in population size over thousands 
of years could be inferred for a species 
from genome-wide DNA polymorphism 
data. However, it remains a formidable 
technical challenge to infer the event of 
RPD because the signal of such an event is 
weak in the typical time scale of observable 
polymorphism. To overcome the limited 
resolution power of the genetic data from a 


single species, we propose a new approach 
which draws conclusions based on the 
collective support from many species. The 
central premise of our approach is that 
thousands of species threatened by extinction 
were primarily due to a common cause in 

the past that led to a significant depletion of 
available habitats and resources. Consequently, 
we were able to draw conclusions based on 
present-day polymorphism data from a large 
number of threatened species and their non- 
threatened relatives. 


We reviewed more than 10,000 peer-reviewed 
papers published in the last two and half 
decades, among which ~2,500 papers in 

164 scientific journals were found to have 
surveyed the genetic diversity of at least one 
vertebrate species. In total, we analyzed the 
genetic diversity data in 2,764 vertebrate 


species. Then we used the International Union 
for Conservation of Nature (IUCN) Red List 
categories to determine the level of extinction 


risk for each species. Our population genetics 
modeling suggests that in many threatened 
vertebrate species the RPD on average began 
in the late 19th century (Figure 2), and the 
mean current size of threatened vertebrates is 
only 5% of their ancestral size. We estimated 

a ~25% population decline every 10 years in 
threatened vertebrate species. 


Here we studied RPD in vertebrates because 
vertebrates have been more extensively 
investigated in the past. However, our 
conclusions should have some generality 

as vertebrate species live in wide-range 
ecosystems. Moreover, the proposed method 
is also suitable for studying non-vertebrate 
species. 


Non-ihreatened 


Population size 


Threatened 


18th 19th 20th 21st 
Century 


Figure 2. Rapid population decline in the late 19th century. 


Future Perspective 


Conservation genetics is an important field, 
especially for the developing countries, 
including China. However, there are no 
other theoretical population geneticists 
working on this subject in China. So in the 
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future, we are going to work closely with 
Chinese conservation geneticists, to conserve 
threatened species more efficiently. 


General Information 


Publications (*co-first authors, 
tco-corresponding authors) [ONLY 
INCLUDE PUBLICATIONS PUBLISHED 
IN 2015,2016 AND 2017, Group 
members highlighted in boldface]. 


+ Li H*, Xiang-Yu J*, Dai G, Gu Z, Ming C, 
Yang Z, Ryder O. A, Li W-Ht, Fu Y-Xt, Zhang 
Y-Pt. Large numbers of vertebrates began 
rapid population decline in the late 19th 
century. PNAS, 2016, 113: 14079-14084. 


+ Gao F, Ming C, Hu W, Li Ht. New software 
for the fast estimation of population 
recombination rates (FastEPRR) in the 
genomic era. G3-Genes Genomes Genetics, 
2016, 6:1563-1571. 
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4.5 Computational Genomics & Big Data Group 


Researchers: Postdoc 
Prof. Yixue Li Liyun Yuan 
Director of Bio-Med Big Data 
g Students 
Center 
a Shengdi Li 
Phone: 86-21 54920089 Gaohui Zhan 
E-mail: yxli@sibs.ac.cn Lu Arou 
Benpeng Miao 
Qingyu Xiao 
Staff Tuantuan Gui 
Xun Gong Yunhe Fu 
Hong Li Bal Yang i Sun me ee 
ing Lin angyoumin Feng uhua Yang 
Zhen Wang Jia Li Yigin Bai Feng li 
Guohui Ding Sheng He Shijie Tang 
Weili Lin Boqiang Zhang 
Research specific scientific problems. 
Overview 


Our group of Computational Genomics & 

Big Data is headed by Dr. Yixue Li since he 
joined PICB in 2016. Dr. Li is also the director 
of the Bio-Med Big Data Center of PICB. Our 
group has broad interests on bioinformatics 
and systems biology. The current research 

is focused on: 1) Animal model genomics. 

We used the methods of comparative 
genomics, population genetics and functional 
genomics, to study and analyze the genetic 
and regulatory mechanisms of specific animal 
models. 2) Cancer computational biology. 

We developed computational methods to 
analyze and interpret cancer multiple-omics 
data, further accelerate the understanding of 
cancer biology and collaborate with doctors 
to resolve clinical problems. 3) Biomedical 
database and algorithms, including 
biomedical knowledge discovery, data mining 
and database construction, building big data 
analysis pipelines, algorithms and tools for 


Animal model genomics 


Genetic origin of hypoxia 
adaptation in Tibetan Massifs 


Background 


The Tibetan Mastiff (TM), a native of the 


ibetan Plateau, has quickly adapted to 

the extreme highland environment. In our 
previous research published in Genome 
Research in 2014, the impact of positive 
selection on the TM genome was studied 
and potential hypoxia-adaptive genes were 
identified, such as EPAS1 and HBB. However, 
the origin of the adaptive variants remains 
unknown. 


Current State of Research 


In the current study, we investigated the 
signature of genetic introgression in the 
adaptation of TMs with dog and wolf 
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genomic data from different altitudes in close 
geographic proximity. On a genome-wide 
scale, the TM was much more closely related 
to other dogs than wolves. However, using 


acquire local adaptation quickly by secondary 
contact with their wild relatives. 


Future perspective 


the ‘ABBA/BABA test, we identified genomic 

regions from the TM that possibly introgressed Our research provides an intriguing example 
from Tibetan gray wolf (Fig. 1). Several of of parallel evolution between dogs and 
the regions, including the EPAS1 and HBB 


loci, also showed the dominant signature of 


humans (Fig. 3). It turns out to be the same 
location, same gene and same mechanism— 
interbreeding as adopted by Tibetan people 
to adapt to the plateaus. Recent evidence 


selective sweeps in the TM genome (Fig. 2). 
We validated the introgression of the two loci 
by excluding the possibility of convergent shows that they may have also acquired their 


high-altitude adaptati 


evolution and ancestral polymorphisms and on by interbreeding with 


examined the haplotypes of all available an ancient hominid known as the Denisovans. 


canid genomes (Fig. 2). The estimated time of The study adds to the significant evidence 


introgression based on a non-coding region generated by scientists of the profound 


of the EPAS1 locus mostly overlapped with the contributions and adaptations that can occur 


Paleolithic era. Our results demonstrated that as a result of ancient interbreeding. This study 


the introgression of hypoxia adaptive genes in was published in Molecular Biology Evolution 


wolves from the highland played an important in 2016 and reported by news in Science and 


role for dogs living in hypoxic environments, Christian Science Monitor. 


which indicated that domestic animals could 
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Figure 1. The analysis of the introgression between Tibetan Mastiff and Tibet grey wolf. (A) The pattern of ABBA/ 
BABA nucleotide sites in the analysis. GLJ is in state A as an outgroup and TW is in state B. TM and YJD, with the 
recent common ancestor that diverged from the ancestral population of TW, are in different state of either A or B. (B) 
Manhattan plot of the Patterson's D-statistic for the 10,688 segments of 200 kb across all autosomal chromosomes. 
Each dot represents a segment and the segments from one chromosome have the same color. There were 236 
segments with significantly positive Patterson's D values indicating the introgression between TM and TW (P-value 
< 0.01, one-tailed Ztest after Bonferroni correction). (C) Distribution of Z-transformed D statistics for the 200-kb 
segments. The red bars showed the significant regions. (D) Venn plot showing the overlapped regions with positive 
introgression and selective sweeps in TM or TW. The regions were all measured by windows of 200 kb. 
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Figure 2. The introgression evidences of the EPAS locus. (A) The distribution of fixed ABBA/BABA sites and Fst in 

this locus. This locus has significantly excessive ABBA sites and significant Fst between TM and YJD. (B) A haplotype 
network based on the top 50 common haplotypes in the locus. The haplotypes were defined from 29 representative 
SNPs across all available canid genomes in DoGSD. Each circle represents a haplotype and the size is proportional 

to the number of individuals belonging to that haplotype. The colors represent different populations. Lines connect 
each haplotype to its most similar relative. Bars represent mutational steps between haplotypes. (C) The probability 
of maintaining the length of the haplotype assuming the recombination rate of per base pair per generation. The 
shorter the divergence time is, the larger the probability is. Even if the shortest time, 11,000 years, was considered, the 
probability of maintaining the EPAS1 haplotype was significantly unlikely (P-value=0.007). (D) Sequence divergence is 
reduced between TM and TW in the EPAS1 locus compared with the genomic background (P-value < 0.01, t-test). 
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Figure 3. Ancient interbreeding with the Tibet gray wolf 
makes the Tibetan Mastiff adapt to the high altitude. Similar 
evolutionary mechanism occurred in parallel in the Tibetan 
people. 


Genetic mechanism for 
hyperglycemia tolerance of 
camels 


Background 


To adapt to the harsh conditions of deserts 

or semi-deserts, camels have acquired many 
special abilities and attributes. For example, 
blood glucose levels in camels are twice those 
of other ruminants. Previous physiological 
experiments demonstrated that the high level 
of blood glucose in camels may be caused by 
their strong capacity for insulin resistance, yet 
they do not develop diabetes or hypertension. 
With the first whole genome sequence 

of Bactrian camel we published in Nature 
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Communications in 2012, we aim to clarify 
the genetic mechanism for hyperglycemia 
tolerance of camels. 


Current State of Research 


Here we report that the copy number 
variations of CYP2 and CYP4 family generate a 
unique metabolic module that creates a well- 
maintained concentration balance between 
two functionally divergent metabolites (20- 
HETE and EETs), which provide camel with a 
protective mechanism against diabetes and 


A 


Concentration of EETs (ng/ml) 


Camel Cattle Haman 


Ratio of 20-HETE/EETs 


other metabolic diseases (Fig. 4). The camel 
genome we sequenced recently suggests 

a high-copy-CYP2J and low-copy-CYP4A/F 
module in camels compared to that in cattle 
and human. Based on the prediction of the 
module, our quantitative experiments using 
LC/MS/MS assay for measuring 20-HETE 

and EETs in the plasma of camels, cattle and 
humans demonstrate that camels can maintain 
significantly higher EETs concentrations and 
lower 20-HETE/EETs ratios compared with that 
in cattle and human (Fig. 4). The higher EETs 
concentrations and lower 20-HETE/EETs ratios 


Camel Cattle Human 


Figure 4. (A) Metabolic and functional balance between EETs and 20-HETE. There are more copies of CYP2J in 
camels than that in cattle and humans, and fewer copies of CYP4A/F in camels than that in cattle and humans. (B, 
C) Comparison of plasma EETs and 20-HETE/EETs in camels (n = 6), cattle (n = 19) and humans (n = 30) measured 
by LC/MS/MS assay. The distributions are shown on a log--scale. The line within the box defines the median; the 
ends of the boxes define the 25th and 75th percentiles; the error bars define the 10th and 90th percentiles. 
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in camel can help it to 


We have shown that c 
their genome during t 
to adopt a CYP2-CYP4 
that provides an endo 


lerate hyperglycemia. 


amels reconstructed 
heir long-term evolution 
metabolic module 
genous protective 


mechanism enabling them to move away 


from metabolic disord 
glucose levels. These fi 


ers involving high blood 
indings illustrated that 


camels should be a unique model for the 


study of metabolic regulation and metabolic 


diseases. Future work will explore the 


expression level and e 
genes in camel tissues 


Genome and tra 


nzyme activity of CYP 


nscriptome of 


laboratory rabbits 


Background 


The European rabbit is an important 


experimental animal model for biomedical 


science. Rabbits are not only the most-used 


animal for the production of antibodies, 


but also they are wide 
a variety of human dis 


ly used for studying 
eases. Unlike mice 


and rats, rabbits have unique features of 


lipid metabolism that have made them an 


important model for h 
diseases. Over the pas 
rabbit model has prov 
breakthroughs into un 
molecular mechanism 
atherosclerosis, includi 


uman cardiovascular 

t century, the 

ided tremendous 
derstanding the 

s of hyperlipidemia and 
ng the discoveries of 


low density lipoprotein receptor deficiency as a 


cause for human fami 


ial hypercholesterolemia 


and statin, the most potent lipid-lowering 


drug. Despite this, genetic information and 


RNA expression profiling of laboratory rabbits 


are lacking. 


Current State of Research 


Our study characterized the whole-genome 
variants of three breeds of the most popular 
experimental rabbits, New Zealand White 
(NZW), Japanese White UW) and Watanabe 
heritable hyperlipidemic (WHHL) rabbits (Fig. 
5). Although the genetic diversity of WHHL 
rabbits was relatively low, they accumulated a 
large proportion of high-frequency deleterious 
mutations due to the small population 

size. Some of the deleterious mutations 

were associated with the pathophysiology 

of WHHL rabbits in addition to the LDLR 
deficiency. Furthermore, this study conducted 
transcriptome sequencing of different 

organs of both WHHL and cholesterol-rich 
diet (Chol)-fed NZW rabbits. It was found 

that gene expression profiles of the two 
rabbit models were essentially similar in the 
aorta, even though they exhibited different 
hypercholesterolemia (Fig. 6). In contrast, Chol- 
fed rabbits, but not WHHL rabbits, exhibited 
pronounced inflammatory responses and 
abnormal lipid metabolism in the liver (Fig. 7). 


Future perspective 


To our knowledge, this is the first genomic 
and systematic study of experimental rabbits. 


The completion of the rabbit genome and 
transcriptome information will facilitate future 
studies using rabbit models to investigate 
human diseases and also drive the generation 
of transgenic and knockout rabbits for 
biomedical researches and drug development. 
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Figure 5. Whole-genome sequencing of laboratory rabbits. (a) Lipid profiles of standard chow-fed NZW, Chol-fed NZW 
and WHHL rabbits analyzed by high performance liquid chromatography. The Chol-fed NZW showed elevated b-VLDLs, 
and the WHHL rabbits showed increased LDLs and reduced HDLs. (b) Cumulative distribution of depth of coverage for 
whole-genome sequencing. The average depth of coverage was 13x for each individual rabbit. (c) Phylogenic tree of the 
rabbits. The tree was constructed on the basis of representative SNPs with the maximum likelihood method. Bootstrap 
values are marked on the branch. (d) Distribution of nucleotide diversity m. The statistics were calculated for every 100 kb 
sliding-window across the genome. 
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Figure 6. Transcriptome profiling of rabbit models with aortic atherosclerosis. (a) Heatmap of DEGs between Chol-fed 
and normal chow-fed NZW rabbits as well as between WHHL and JW rabbits. The DEGs should have FDR < 0.1 in at least 
one comparison. Log,-fold changes of DEGs are illustrated by gradient colors. The transcriptional changes of Chol-fed 
and WHHL rabbits were similar in the aorta but distinct in the liver. (b) Strong positive correlation of expression changes 
in the aorta between Chol-fed and WHHL rabbits. The correlation coefficient was calculated for DEGs in at least one 
condition. (c) Macrographs of aortas in normal chow-fed, Chol-fed and WHHL rabbits. The aortic lesions are stained red 
with Sudan IV. Both Chol-fed and WHHL rabbits showed extensive atherosclerotic lesions. (d) Heatmap of representative 
DEGs responsible for inflammation responses in the aorta. The read counts were log-transformed and normalized across 
samples. These genes induced inflammatory responses in both Chol-fed and WHHL rabbits compared with the normal 
controls. 
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Cancer computational biology 


Background 


New technologies and big data promote the 
development of cancer research. Analyzing 
and interpreting cancer big-data is becoming 
more and more important. Therefore, we 

aim to use computation biology methods 

to accelerate the transformation of data to 
information, and finally to knowledge. 


Current State of Research 


For the basic research, we utilize omics data 
to study cancer subtypes, heterogeneity and 
evolution. Taken MSH2 mutated tumor as an 
example, we comprehensively investigated the 
genomic landscape and proposed a mutation 
progress model (Fig. 8). In inherited Lynch 
syndrome, an individual inherits a pathologic 
MSH2 mutation from parents; this mutation 
results in the damage of DNA mismatch repair 
(MMR) system; Sometimes somatic mutation 
or methylation may serve as “second hit” at 
the wild-type allele or other MMR genes; Cell 
accumulates huge somatic mutations and 
result in carcinogenesis. On the other hand, 
MSH2 mutation in sporadic cancers occurred 
only in one somatic tissue, and it more likely 
occurred after the initial driver mutations. 


For translational research, we compare the 
omics data of different subgroups to find 
candidate signatures, or build computational 
models to predict clinical features. Taken 
glioma as an example, we used differential 


coexpression and differential regulation 
analysis, and revealed a novel three- 
transcription-factor signature. This signature 


clusters glioma patients into three major 
subtypes which are significantly different 
in patient survival as well as transcriptomic 
patterns (Fig. 9). 


We also collaborate closely with doctors to 
understand problems in clinical practice, 
such as drug sensitivity of liver cancer and 
circulating tumor DNA of lung cancer. We 
hope that our researches could provide clues 
for cancer diagnosis, metastasis, recurrence, 
survival, and drug treatment. 


Future perspective 


Although we have done lots of work in 
cancer computational biology, we realize 
that there exist some problems. Firstly, how 
to obtain enough sample or omics data? 
Collection of high-quality cancer samples 
are not an easy task and high-throughput 
experiments are expensive. A single study 
usually can not produce enough omics data. 
To resolve this problem, we will build cancer 
databases to collect and accumulate public 
datasets. Secondly, how to develop new 
methods or analysis strategies? Due to the 
complexity of biology and heterogeneity 

of cancer patients, directly application of 


traditional methods might not work well. 


We are committed to better understanding 
cancer characteristics and developing more 
sophisticated algorithms. Finally, how to make 
the preliminary results valuable to clinic? 
High-throughput experiments and limited 
sample size usually cause false-positive 
results, therefore validation in larger samples 
is necessary. Additionally, biomarkers that 

are found via cancer tissue samples need to 
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Figure 8. MSH2 mutations and effects in inherited and non-inherited cancers. A) Proposed model for the mutagenic 
progress of multiple primary cancers in lynch syndrome. B) Proposed model for the mutagenic progress of non-inherited 
cancers. 
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Figure 9. The clustering heat maps and survival analysis results with three-TF signature in five data sets. The numbers of 
clusters (k) were determined by NMF based on the expression signatures of 3 TFs. Heat maps of three-TF DRA signature 
in glioma samples are shown on the left. Kaplan-Meier survival curves of the overall survival for the patients from each 
molecular subtype are shown on the right. 
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be validated in blood, biomarkers that are 
found in cell-line or animal model need to be 
validated in patients. These work cannot be 
finished in laboratory, we need to work with 
doctors and carry out win-win cooperation. 


Bioinformatics algorithm 


cisASE: detecting allele-specific 
expressions 


Background 


Allele-specific gene expression (ASE) refers to 
the differential expression of two alleles in a 
diploid genome, which is primarily a result of 
the associated cis-element variation and allele- 
specific epigenetic modifications. ASE studies 
provide an opportunity for the development 
of new therapeutic strategies that activate 
beneficial alleles or silence mutated 

alleles at specific loci. Next-generation 
sequencing (NGS) allows for the genome- 
wide identification of ASE; however, several 
problems exist. The first problem is technical 
and intrinsic allele bias resulting from genome 
mapping and CNV. The second problem is 
that gene-level ASE detection usually requires 
phased-SNVs or parental genomes, which are 
usually not available. The third problem relates 
to testing ASE with statistical confidence, 
which did not take the sequencing errors into 
account. 


Current State of Research 


To overcome the problems outlined above, we 
proposed a new computational method and 
developed a software tool, cisASE, based on a 
likelihood ratio test (Fig. 10). cisASE uses DNA- 


seq data to make site-by-site adjustments 
for RNA allele imbalance assessment to 
reduce the effects of technical bias and CNV. 
cisASE could report ASE on single nucleotide 
variant (SNV), exon and gene levels from 
sequencing data without requiring phasing 
or parental information. It also considers the 
quality of each base to reduce the influence 


of sequencing error. We tested cisASE on 
both simulated and real datasets (Fig. 11), 

and found that cisASE exhibits significantly 
improved accuracy and computational speed 
compared with existing state-of-the-art 


methods. In the absence of matched DNA-seq 
data, cisASE performed moderately well. We 
applied cisASE to public colon tumor datasets 
and found higher cis-regulated ASE level in 
the tumor than normal samples. Meanwhile, 
we observed several important features, i.e. 
germline ASE hotspots of human leukocyte 
antigen (HLA) loci, cancer somatic ASE genes 
in focal adhesion and extracellular matrix 
(ECM)-receptor interaction pathways. cisASE 
is freely available at http://lifecenter.sgst.cn/ 
cisASE. 


Future perspective 


cisASE integrates DNA-Seq and RNA-Seq data 
to identify cis-acting regulatory variations, 
which provides a new opportunity to 

detect driver mutations in diseases. We have 
illustrated the landscape of germline and 
somatic ASE in colon tumors, and expect that 
the method could have broad applications in 
other cancer types. 
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Figure 10. Overview of cisASE algorithm. Mapped RNA-seq and DNA-seq data are piled up using samtools. LLR of the 
null and alternative models is calculated for each feature, i.e. SNV, exon or gene. Simulated DNA and RNA allele counts are 
generated from the data to produce a null distribution of the LLR, which is used to define an LLR cutoff with respect to a 
particular significance level. The LLR of each measured feature is compared to the LLR cutoff to determine whether it is an 
ASE. 
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Figure 11. Distribution of the percentage of ASE over-expressing reference alleles in a real dataset. For each cancer (N 
= 16) and normal sample (N = 16), we calculated the percentage of ASE specific to the reference allele (ASE with over- 
expressed reference allele). Ideally, without technical and assessment bias, we would expect a balanced proportion of 
ASE SNVs over-expressing reference allele and alternative allele (0.5 versus 0.5). The observed proportion of ASEs over- 
expressing reference alleles was 0.50 when using cisASE and 0.67 when using MBASED, indicating that cisASE efficiently 
decreased the false positive results caused by pre-existing bias, especially bias toward the reference allele. 


LncPriCNet: prioritizing disease limited. Candidate IncRNAs may be obtained 

candidate IncRNAs by differential expression analysis or genome- 
wide association study analysis, but there are 

Background still too many candidates to experimentally 


validate. Therefore, it is urgent to prioritize 
IncRNAs that are potentially associated with 
diseases. 


LncRNAs play pivotal roles in many important 
biological processes. Currently, the functions 
of most IncRNAs remain unknown, and the 
knowledge of disease-related IncRNAs are very 
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Current State of Research 


Assuming that functionally related IncRNAs 
and genes play roles in phenotypically similar 
diseases, we proposed a computational 
method called LncPriCNet (disease candidate 
LncRNAs Prioritization based on a Composite 


Network) to prioritize disease-related IncRNAs. 


First, we constructed a composite network 
integrating multi-level information including 
phenotypes, IncRNAs, genes and their 
associations (Fig. 12a). Then we extends the 
random walking with restart (RWR) algorithm 
to a multi-level network to capture global 
information (Fig 12b). LncPriCNet achieves 
an overall performance superior to that of 
previous methods, with high AUC values of 
up to 0.93. We further used LncPriCNet to 
infer relationships between all IncRNAs and 


53 disease phenotypes to chart a predicted 
IncRNA-disease landscape (Fig 13). The 
predicted landscape revealed the modularity 
of the disease-IncRNA network and identified 
several IncRNA hotspots. An R-based package 
of LncPriCNet is available at https://cran. 
r-project.org/ 


Future perspective 


LncPriCNet is a useful tool for disease 

IncRNA prioritization and provides better 
understanding of the molecular mechanisms 
of human disease at the IncRNA level, which 
may uncover new diagnostic and therapeutic 
opportunities. The strategy of the multi-level 
composite network could be used in other 
fields of biomedicine, such as disease, drug 


and target discovery. 


C)incRNA /\ phenotype [ ] gene 
LRNAIMCRNA associations OQO O O D 


=> 


Gene-gone interactons CHO A D 


Phenotype ncRNA assooabtons LO O 


Gene-incRNA associations [ HO D O 


Prenotype-gene assooatons LD O 


multi 


Hevel com e network 


O A 


; 


Map the candidates anl sends 


! into the composite netwok 


em IiE O 


Extended Random Walking Score and ranking canbdutes 
on the composite network basd on global similarity to seeds 


Figure 12. The flow chart of LncPriCNet. (a) Construction of the multi-level composite network. This network is 
constructed by six sub-networks. The thickness of the edge indicates the weight score. (b) The flow chart by which 
LncPriCNet optimizes the candidate IncRNAs. First, the candidate IncRNAs of interest and seed nodes are mapped to the 
multi-level composite network. Then, a global extended RWR method is used to score the candidate IncRNAs according to 
their proximity to seed nodes. Finally, the candidate IncRNAs are ranked according to the scores. 
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Figure 13. Global view of the predicted landscape of human disease IncRNAs. (a) Hierarchical clustering of the 
LncPriCNet scores between 53 phenotypes and 10082 IncRNAs. The color of each cell represents the LncPriCNet score of 
a IncRNA (row) for a phenotype (column). Phenotype clusters were annotated with enriched disease categories (bottom), 
and IncRNA clusters were annotated with the most enriched pathways of their co-expressed genes (right). The red circled 
region indicates a module composed of IncRNAs involved in the cell cycle process. (b) Zoom-in plot of the red circled 
region, involving 3 type of cancers and 156 high-risk IncRNAs. (c) Enriched pathways for the co-expressed genes of 156 
high-risk IncRNAs. 
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Dianging Wu, Yale School of Medicine, USA. 


e Ethnic difference of prostate cancer based 
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integration. Uniformed Services University 
of the Health Sciences-Center for Prostate 
Disease Research (CPDR), and the US 
National Cancer Institute (NCI), USA. 
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4.6 Regulatory and Systems Genomics 


Researchers: Students 

Dr. Zhen Shao (Principal Shi-qi Tu 

Investigator) Jia-ying Yao 
Phone: +86-21-54920367 Yin Huang 


Hong-duo Sun 
Feng-xiang Tan 


Email: shaozhen@picb.ac.cn 


Staff scientists Min Shao 


Yuannyu Zhang (Assistant Professor) 
Yuangao Wang (Assistant Professor) 
Ge Shi (Research Associate) 

Mushan Li (Research Associate) 


Research 
Overview 


The Regulatory and Systems Genomics group 
was established in Oct 2013 and is headed 
by Dr. Zhen Shao, as a new addition to the 
Partner Institute of Computational Biology. 


Our primary research interest can be 
predominantly divided into three categories: 
(1) develop computational tools and platforms 
for quantitative comparison and integration 
of multi-omics data (genomic, epigenomic, 
transcriptomic, proteomic, etc.); (2) collaborate 
with bench-work biologists to study the 
regulation of tissue specific gene expression; 
(3) Uncover the underlying functional and 
structural hierarchy of genomic elements 
including enhancers and non-coding RNAs. 


|. Epigenetic regulation of gene 
transcription 


Current State of Research 


1. Non-canonical function of PRC2: 


Epigenetic machinery is crucial for tissue 
development and cellular homeostasis, and 
its deregulation often drives the pathogenesis 
of human disorders. Polycomb repressive 
complex 2 (PRC2) represents a major class 

of epigenetic regulator that participates 

in transcriptional repression by catalyzing 
histone H3 lysine 27 di/tri-methylation 
(H3K27me2/3). The canonical PRC2 complex 
consists of EED, SUZ12, and the histone 
methyltransferase EZH2. A confounding 
feature of the mammalian PRC2 complexes is 
the existence of two highly related enzymatic 
subunits EZH1 and EZH2 with near-identical 
catalytic SET domains. Whereas the role of 
EZH2 in H3K27me3-mediated transcriptional 
repression has been well established, the 
function of EZH1-PRC2 remains elusive and 
controversial. For example, in embryonic and 
skin stem cells, EZH1 complements EZH2 

to maintain repressive chromatin and stem 
cell identity. In contrast, Ezh1 predominantly 
targets H3K4me3-marked active promoters 
and promotes RNA polymerase (Pol) II 
elongation in differentiating muscle cells and 
hippocampal neurons. Similarly, the role of 
PRC2 in hematopoiesis remains elusive due 
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in part to the possible redundancy of EZH1/2 
and difficulties in distinguishing effects 
related to canonical and non-canonical PRC2 
functions that are mediated by EZH1 or 
EZH2 independent of their methyltransferase 
activity. 


Our collaborators previously established a 
two-phase ex vivo culture system to model 
the differentiation of primary human CD34+ 
hematopoietic stem/progenitor cells (HSPCs) 
to erythroid progenitor cells (ProEs). Using 
this system, we found EZH1 is progressively 
and significantly upregulated and EZH2 
expression is modestly downregulated during 
differentiation, whereas EED and SUZ12 
mRNA and protein levels remain largely 
unchanged. To identify gene targets for each 
PRC2 core subunit, we performed RNA-seq 
transcriptomic analysis of differentiated ProEs 
upon shRNA-mediated silencing of each 
subunit. Surprisingly, depletion of EZH1 or 
SUZ12 resulted in significantly more genes to 
be downregulated than depletion of EZH2 
and EED (Figure 1A), suggesting that these 
genes may be directly or indirectly dependent 


on EZH1 or SUZ12 for optimal expression. 

To relate gene expression changes directly 
with occupancy of the PRC2 complexes, we 
next determined the chromatin targets of 
each PRC2 core subunit by ChiP-seq analysis. 
We extracted the promoters occupied by at 
least one subunit and performed k-means 
clustering analysis. Importantly, PRC2-targeted 
genes can be separated into two distinct 
categories: one group is predominantly 
occupied by SUZ12, EED, and EZH2, highly 
enriched for H3K27me3. We named this 
category “Canonical PRC2 targets.” In 
contrast, the “Non-Canonical PRC2 targets” 


are predominantly occupied by SUZ12 and 
EZH1, and are enriched for H3K4me3 and DHS, 
in addition to many TFs known to activate 
erythroid gene expression (Figure 1B). We then 
integrated the chromatin binding data with 
the gene expression changes and correlated 
combinations of PRC2 subunit binding with 
PRC2-mediated repression or activation 


of target gene expression. Importantly, 
EZH2+SUZ12, EED+SUZ12, EZH2+EED+SUZ12, 
or EED alone strongly correlate with 
transcriptional repression. In contrast, 


EZH1+SUZ12 strongly correlates with gene 
activation. EZH1 or SUZ12 alone also associate 
with activation (Figure 1C). Therefore, EZH2 

or EZH1, through differential association with 
SUZ12, mediate distinct transcriptional outputs; 
EZH2+SUZ12 is predominantly repressive, 
whereas EZH1+SUZ12 is predominantly 
activating. 


2. Functional hierarchy of super enhancers: 
Enhancers are the primary determinants of 
cell identity, but the regulatory components 
controlling enhancer turnover during lineage 
commitment remain largely unknown. 

We compared the enhancer landscape, 
transcriptional factor occupancy, and 
transcriptomic changes in human fetal and 
adult hematopoietic stem/progenitor cells 
and committed erythroid progenitors. We find 
that enhancers are modulated pervasively and 
direct lineage- and stage-specific transcription. 
GATA2-to-GATA1 switch is prevalent at 
dynamic enhancers and drives erythroid 


enhancer commissioning. Examination 
of lineage-specific enhancers identifies 
transcription factors and their combinatorial 


patterns in enhancer turnover. Importantly, by 


CRISPR/Cas9-mediated genomic editing, we 
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Figure 1. (A) Gene expression changes upon depletion of each 
PRC2 subunit. The numbers of upregulated (fold change R 2, 

p value < 0.05; PRC2-repressed) and downregulated (PRC2- 
activated) genes are shown for each knockdown.(D) Right 
panel: the expression changes of PRC2-activated and PRC2- 
repressed genes during erythroid differentiation. (B) ChIP-seq 
density heatmaps for H3k4me3, H3K27me3, and PRC2 subunits 
within each promoter category (left). K-means clustering of 

all PRC2-associated promoters identifies canonical and non- 
canonical PRC2 targets (right). (C) Gene expression correlation 
analysis of PRC2 subunit composition and transcriptional 
activities. (D) Chromatin signatures and TF occupancy within 
the human or mouse SLC25A37 locus in HSPC (A0) versus ProE 
(A5) or undifferentiated G1E versus differentiated GIER cells 

are shown, respectively. The SLC25A37 constituent enhancers 
(E1, E2, and E3) and the proximal promoter (P) are depicted by 
shaded lines. The sequence conservation by PhastCons analysis 
is shown. (Adapted from Xu et al, Molecular Cell 2015 and 
Huang et al, Developmental cell 2016) 


uncover functional hierarchy of constituent 
enhancers within the SLC25A37 super- 
enhancer. Despite indistinguishable chromatin 
features, we reveal through genomic editing 
the functional diversity of several GATA 

switch enhancers in which enhancers with 
opposing functions cooperate to coordinate 
transcription. Thus, genome-wide enhancer 
profiling coupled with in situ enhancer editing 
can provide critical insights into the functional 
complexity of enhancers during development. 


Future Perspective 


Cis regulatory elements including enhancers 
are known to function as multifactorial 
platforms for binding of lineage-regulating TFs, 
chromatin regulators, and signaling effectors. 
A long-standing question is how these 
elements acquire the ability to translate intra- 
and extracellular signals to cell-type specific 
transcriptional responses in development and 
disease. On the other hand, it is estimated that 
there are 2,000-3,000 sequence-specific DNA 
binding TFs encoded by the human genome, 
with 200-300 TFs being expressed in each 

cell type. However, the regulatory networks in 
which TFs and the other chromatin associated 
factors collaborate with each other to regulate 
transcription dynamics remain to be poorly 
understood. In future, more computational 
tools and analysis are still needed to further 
decode the underlying principles of tissue- 
specific transcriptional regulation. 
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Il. Post-transcriptional regulation of 
gene expressions 


Current State of Research 


1. Methodology development: In eukaryotic 
cells, most of the protein-coding genes 

need to be expressed as proteins to carry 

out their functions. In the past two decades, 
mass spectrometry (MS) has become one 

of the most powerful tools to quantitatively 
measure the abundance of proteins in 
biological samples, providing a routine way 
to analyze protein expression levels and post- 
translational modification status. Driven by 
the increasing need to comprehensively 
assess protein expression changes under 
different biological contexts, numerous MS- 
based techniques have been developed to 
support relative quantification of protein 
abundance. Among these techniques, 
protein quantification based on stable 
isotope labeling plays an important role in 
proteomic studies due to the high accuracy 
and efficiency. Most of the isotope labeling 
based methods introduce different stable 
isotope labeling into proteins or peptides of 
different biological samples, to create specific 
mass tags which can be distinguished by 
high-resolution MS instruments. Using this 
technique, people could efficiently quantify 
the relative expression levels of thousands of 
proteins across multiple biological samples 

in a single MS run. Then, an important task is 
to accurately identify proteins with significant 
expression changes between these samples. 
A number of statistical models have been 
developed for this task. However, these 
models typically require either pre-knowledge 
or separate experiments to assess the noise 


level of isobaric experiments, and generating 
technical replicates will result in additional 
costs in both time and resources. Besides, 
instrumental variations may not be negligible 
among different experiments. Thus, the most 
accurate estimation of technical errors should 
be directly generated from the proteomic 
profiles under comparison. 


We developed a new computational model 
(Figure 2A), termed Model-based Analysis 

of Proteomic data (MAP), to statistically 
compare proteomic profiling data generated 
from different biological samples and 

directly identify proteins showing significant 
abundance changes without involving 
information from technical replicates. As the 
key feature of MAP it considers all detected 
proteins as a mixture of differentially expressed 
and non-differentially expressed ones, and 
chooses only those with low intensity changes 


to model the technical and systematic 
errors. Technically, MAP applies a step-by- 


step regression analysis to the MS intensities 
of selected proteins to globally estimate the 
impact of technical and systematic errors as 

a function of the MS intensities. Then, the 
estimated error function is used as reference 
to calculate a P-value for each protein, to 
represent the significance of its intensity 
change. To validate the effectiveness of this 
new approach, we utilized the DEEP SEQ MS 
technique to extensively perform quantitative 
proteomic profiling in both undifferentiated 
and differentiated mouse embryonic stem 
cells (MESCs), and applied MAP to analyze the 
protein expression changes. To compare with 
MAP, two existing tools were also applied to 
this dataset. By incorporating with a set of 


previously published ribosome profiling data 
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of undifferentiated and differentiated mESCs, detected from ribosomal profiling data as 
we found MAP showed clearly better accuracy reference. In addition, we provided a web 
in detecting the differentially expressed platform of MAP is provided to facilitate its use 
proteins during mESC differentiation (Figure by the community (http://bioinfo.sibs.ac.cn/ 
2B-C), by using the mRNA translation changes shaolab/MAP). 
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Figure 2. (A) Workflow of MAP. MAP takes two protein abundance profiles generated by quantitative MS 
experiments as input. After global normalization of MS intensities, a step-by-step regression analysis is applied to 
model the contribution of technical and systematic errors in the observed intensity changes, and the obtained 
model is then applied to every detected protein as a reference to infer the significance of its intensity change. (B) 
The accuracy score of the top 500 differentially expressed proteins between undifferentiated and differentiated 
mESCs detected by MAP and two existing methods from each of the three runs separately. Here the accuracy score 
was defined as the fraction of proteins whose abundance change is consistent with the translation change of the 
corresponding mRNA transcript detected from ribosomal profiling data in terms of direction. (C) The accuracy score 
of the top 500, 1000 and 1500 differentially expressed proteins detected by three methods by combining all three 
MS runs. Here the top differentially expressed proteins were selected based on the most significant P-value of each 
protein among three runs. (Adapted from Li et al, under review). 


2. Study of the protein expression change regulation. With the advent of high 

during erythroid cell differentiation. throughput sequencing, significant advances 
Cellular differentiation requires highly have been made in understanding genome 
coordinated gene expression through and epigenome variation. These studies 
transcriptional and post-transcriptional have relied on quantifying mRNA as the 
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Figure 3. (A) The correlation between RNA and protein expression changes in adult-stage HSPCs (A0) and ProEs (A5) is 
shown. (B) GO enrichment analysis of protein and RNA expression changes. The green box highlights the top enriched 

GO terms for Protein-only genes. (C) Heatmap is shown for transcription factors that are significantly upregulated at the 
protein but not mRNA levels. The genes are ranked based on the significance of fold changes in protein expression between 
HSPCs (AQ) and ProEs (A5). n.d. not detected. (D) Model of mTORC1-mediated post-transcriptional control of mitochondrial 
biogenesis during erythropoiesis. (Adapted from Liu et al, Nature Cell Biology 2017). 


sole measurement of gene expression, 

yet the extent to which changes in mRNA 

are translated to protein at a genome 

scale remains unknown. We reasoned that 
comparing the proteomic and transcriptomic 
profiles in physiologically relevant 
developmental contexts might provide 
insights into the post-transcriptional pathways 
in lineage specification of tissue stem cells, 
which would otherwise have been overlooked 
using only transcriptome-based analysis. 


Through an unbiased comparison of 
proteomic and transcriptomic changes 
between human hematopoietic stem/ 


progenitor cells (HSPCs) and differentiated 
erythroid progenitor cells (proEs), we found 
that 1,050 out of the 1,549 (67.8%) proteins 
upregulated in proEs did not show parallel 
changes in cognate mRNAs, which were 


then named as ‘Protein-only’ genes by us 
(Figure 3A). This finding suggests a large 
number of genes are regulated through 
post-transcriptional mechanisms during 
erythroid cell differentiation. By gene 
ontogeny (GO) analysis, the most enriched 
pathways in these ‘Protein-only’ genes 

are related to mitochondrial biogenesis, 
including ATP biosynthesis, electron transport 
chain, oxidative phosphorylation, and 
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cellular respiration (Figure 3B), suggesting 
mitochondrial biogenesis enhanced 

through post-transcriptional mechanisms. 
Mitochondrial biogenesis is controlled through 
coordinated regulation of signaling pathways 
and downstream TFs. We next surveyed 

1,314 annotated human TFs and identified 

28 TFs whose proteins but not mRNAs were 
significantly increased during erythropoiesis 
(Figure 3C). Notably, mitochondrial 
transcription factor A (TFAM), were ranked third 
based on the significance of fold increases in 
protein expression between HSPCs and ProEs 
(Figure 3C). TFAM is essential for transcription 
and replication of mitochondrial genome. 

We then validated that loss of TFAM leads to 
profound changes in intracellular metabolites, 
histone acetylation and gene expression, 

thus providing a functional link between 
mitochondrial metabolism and epigenetic 
regulation required for erythropoiesis. 
Mechanistically, mTORC1 signaling is enhanced 
to promote translation of mitochondria- 
associated transcripts through TOP-like motifs. 
Genetic and pharmacological perturbation 

of mitochondria or mTORC1 specifically 
impairs erythropoiesis in vitro and in vivo, and 
those ‘Protein-only’ genes, including TFAM, 
PHB2 and many of those genes related with 
mitochondrial biogenesis, were significantly 
more enriched in TOP-like motifs compared 
to all genes. Hence, our results support a 
model (Figure 3D) in which the erythroid 
mitochondria are regulated through post- 
transcriptional mechanisms and may have 
direct relevance to hematological disorders 
associated with mitochondrial diseases and 


aging. 


Future Perspective 


As a future extension of these studies, it’s 
interesting to note that the high consistency 
observed between the protein expression 
changes during mESC differentiation and the 
translation changes of corresponding mRNAs 
detected from ribosomal profiling data gave 
us a useful hint about post-transcriptional 
regulation of gene expression in eukaryotic 
cells. For example, close to 85% of the top 500 
differentially expressed proteins detected in 
this process by using MAP to combine the 
proteomic profiling data generated in all three 
MS runs showed a highly consistent change 
in mRNA translation (selected based on the 
second best P-value among the three runs), 
and this fraction for the top 1000 proteins is 
still higher than 75%. Given the fact that these 
top proteins typically are highly abundant in 
the samples being compared, it’s reasonable 
to speculate that a large part of the post- 
transcriptional regulation of gene expression, 
especially for those highly expressed genes, 


is mediated by the sequence signatures 
encoded in RNA molecules, which can also be 
supported by the findings of our own studies 
as well as many other recent studies in this 
field. Thus, a systematic integrative analysis 

of proteomic, transcriptomic and ribosomal 
profiling data can provide a lot more 
information about the mechanism of post- 
transcriptional regulation of gene expression 
than separate analysis on these data. 
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Ill. Interaction between PRC2 and 
Long Noncoding RNAs 


Current State of Research 


Polycomb group (PcG) proteins, in particular 
Polycomb repressive complex 2 (PRC2), 

are important epigenetic regulators in 
development and disease. The primary 
function of PRC2 is to deposit histone mark 
H3K27me3 at specific genomic regions, 
leading to the formation of heterochromatin 
and transcriptional repression. In mammalian 
cells, despite the functional importance 


PcG proteins, the underlying mechanisms 
controlling their site-specific chromatin 
recruitment remain incompletely understood. 
Since the identification of XIST and HOTAIR, 
long non-coding RNA (IncRNA) mediated 
recruitment of PRC2 has become a plausible, 
potentially sequence-dependent mechanism 
for its target regulation. In 2009, a set of 

RNA co-immunoprecipitation and chip 


hybridization (RIP-chip) experiments were 
conducted to examine the expression 

and function of hundreds of IncRNAs in 
three different human cell types. The study 
identified more than 200 IncRNAs that can 
physically interact with PRC2 core subunit 
EZH2 and SUZ12, providing a population-scale 
evidence of the PRC2-IncRNA interaction. 
However, till now only a few large-scale RIP 
experiments have been published for PcG 
proteins, which makes it extremely difficult 
to study the role of their interactions with 
IncRNAs. More importantly, it remains under 
debate whether PRC2 binds to RNA in a 
sequence dependent manner, which may be 
largely due to the high noise level of RIP-seq 
experiments. 


In order to address these important questions, 
we carry out a systematic analysis of the DNA 
sequence patterns associated with PRC2- 
binding IncRNAs in both human and mouse 
genomes. In particular, we have developed a 
new computational pipeline for analyzing the 
composition of long DNA and RNA sequences 
of variable length using a Markov-chain 

based approach (Figure 4A). It considers each 
sequence as a series of transitions between 
adjacent nucleotides and uses the frequency 
of observing each possible transition to 
characterize the composition of this sequence 
(Figure 4B). Through application of this 
pipeline to the PRC2-binding and non-binding 
IncRNAs identified from publicly available RIP 
data in human and mouse, we discovered a 
number of transitions that are differentially 
favored by these two classes of IncRNAs as 
the sequence features associated with PRC2- 
IncRNA interactions. By mapping all possible 
transitions to a complete quad-tree (Figure 
4CD), we found a considerable fraction of 
transitions favored by PRC2-binding IncRNAs 
are located in consecutive paths (Figure 4E- 


F), and these transitions are more likely to 

be also favored by the mouse PRC2-binding 
IncRNAs derived from the RIP-seq data of 
EZH2 generated from mouse embryonic stem 
cells (Figure 4G-H). 


To make a more comprehensive assessment 
of the cross-species comparison of PRC2- 
binding IncRNAs, we further incorporated 

a recently published EZH2 PAR-CLIP-seq 
dataset that were also generated in mouse 
embryonic stem cells, and obtained 13,764 
putative RNA-contact sites (RCSs) of EZH2 
from this dataset. Interestingly, almost half of 
the mouse PRC2-positive IncRNAs with high 
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Figure 4. (A) Workflow of the sequence composition analysis pipeline. (B) Calculation of transition frequency, 
which is defined as the frequency of observing a transition in the given sequence. (C) A building block of quad-tree 
comprised of 4 transitions with the same prefix. (D) The complete quad-tree of height 6 constituted by all possible 
transitions of order 0-5. (E) A branch cut from the quad-tree shown in (D), which starts from level 3 and contains 
two consecutively favored paths (CFPs). (F) Summary statistics of the CFPs observed in (D). (G) Fractions of different 
groups of transitions that are identified as mouse PRC2-favored transitions. Here, the P-values were computed 

by right-tailed Fisher's exact test based on hypergeometric distribution. (H) Boxplot of the AUC values of human 
PRC2-favored transitions in predicting mouse PRC2-binding IncRNAs. Here the human PRC2-favored transitions 
are divided into 2 groups based on whether or not they are located in CFPs, and the AUC value of a transition is 
calculated by directly using its frequency in each sequence as the prediction score. (Adapted from Tu et al, Scientific 


Reports 2017) 


cross-species prediction scores have EZH2 
RCS identified from the PAR-CLIP-seq data, 
which is significantly higher than that of the 
PRC2-positive IncRNAs with low prediction 
scores (29.0%, P = 0.003 by Fisher's exact test), 
indicating they are more likely to be true 
PRC2-binding IncRNAs. On the other hand, 
still a considerable faction of the mouse PRC2- 
negative IncRNAs with high cross-species 
prediction scores were found to contain EZH2 
RCS (29.5%), which is also significantly greater 
than that of the PRC2-negative IncRNAs 

with low prediction scores (13.9%, P = 5E-5), 


implying many of them may actually have 
the potential to physically interact with PRC2 
as predicted by their sequence similarity 
with the human PRC2-binding IncRNAs. 
Inspired by these findings, we defined high- 
confidence mouse PRC2-positive IncRNAs as 
the mouse PRC2-positive IncRNAs that also 
contain RCS of EZH2, and high-confidence 
mouse PRC2-negative IncRNAs as the mouse 
PRC2-negative IncRNAs with no EZH2 RCS. By 
taking only these high-confidence IncRNAs 
into account, we found the accuracy of cross- 
species prediction is even higher (AUC = 
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0.72, Figure 5A), which strongly supports that 
a considerable proportion of the sequence 
patterns associated with PRC2-IncRNA 
interactions are shared between human and 
mouse IncRNAs. 


We further investigated the distribution of the 
sequence features of PRC2-binding IncRNAs 
along their gene bodies. Each human PRC2- 
positive IncRNA was scanned by a sliding 
window of size 500bp and a local consistency 
score was assigned to the sequence fragment 
in the window, which is calculated as the 

sum of the frequencies of all PRC2-favored 
transitions in this fragment minus those of all 
PRC2-disfavored ones. In this way, sequences 
with high consistency scores should be 
highly enriched for PRC2-favored transitions 
and depleted of PRC2-disfavored ones. 
nterestingly, these IncRNAs exhibit highly 
non-uniform local consistency scores along 


their gene bodies, and some fragments of 


them have clearly higher scores than the 
others (Figure 5B). Inspired by this finding, 
we defined the fragment with the highest/ 
lowest consistency score in each PRC2- 
positive IncRNA as its PRC2-favored/disfavored 
fragment (Figure 5B), respectively. To know 
whether they can be potentially important 
for PRC2-IncRNA interactions, we examined 
the RNA binding of PRC2 on these fragments 
as well as their conservation level across 
vertebrate genomes (Figure 5B). For the first 


analysis, we incorporated a recently published 
fRIP-seq dataset of PRC2 core subunit EZH2 
and SUZ12 in K562 human leukemia cell line14, 
and calculated the fRIP-seq read density at 
each PRC2-favored and disfavored fragment. 
nterestingly, binding of EZH2 and SUZ12 at 
PRC2-favored fragments was found to be 


stronger than that at PRC2-disfavored ones. 
Meanwhile, we also observed PRC2-favored 
fragments have significantly higher average 
conservation scores than PRC2-disfavored 
ones (P = 1.7E-04 by paired t-test; Figure 

5C). More explicitly, 30% of PRC2-favored 
fragments overlap with conserved elements, 
which is significantly higher than that of the 
500bp fragments randomly selected from the 
same IncRNAs (P = 6E-04), and this fraction for 
PRC2-disfavored fragments is only 13% (P = 2E- 
04; Figure 5D). Taken together, these findings 
indicate the PRC2-favored fragments, which 
are highly enriched with sequence features 
associated with PRC2-IncRNA interactions, 

are generally more conserved than the other 
parts of the IncRNAs they belong to, and, thus, 
are more likely to be of functional importance. 


In addition, we also found our new sequence 
composition analysis pipeline showed a better 


performance than traditional K-mer based 
method in predicting PRC2-binding IncRNAs, 
especially on those extremely long ones 
(Figure 5E-F). 


Future Perspective 


An interesting aspect of our findings is that the 
sequence pattern of IncRNAs can be highly 
complex along their gene bodies, leading to a 
hypothesis that such a great complexity might 
be necessary for its functions. For example, we 
have recognized a set of fragments that are 
highly enriched with the sequence features 
associated with PRC2-IncRNA interactions, and 


found these fragments are significantly more 
highly conserved than the other parts of these 


IncRNAs, implying they may be important 
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Figure 5. (A) ROC curve and corresponding AUC value of the prediction model trained with human IncRNAs in predicting 
the mouse PRC2-positive IncRNAs derived from the RIP-seq (blue) and PAR-CLIP-seq (green) data, as well as the high- 
confidence mouse PRC2-positive ones defined by combining these two datasets(red). (B) (B) A representative PRC2-positive 
IncRNA locus. Here its PRC2-favored and disfavored fragment are indicated by the red and blue bars, respectively, and the 
red tracks in the middle show the fRIP-seq read counts of EZH2 and SUZ12 in human K562 cell line. (C) Boxplot of the average 
PhastCons conservation scores of the PRC2-favored and disfavored fragments identified from human PRC2-binding 
IncRNAs. (D) Distribution of the fraction of the 500bp fragments randomly selected from human PRC2-binding IncRNAs that 
overlap with the conserved elements. Here the distribution was draw from 10° times of random sampling and dash lines 
represent the fraction of PRC2-favored/disfavored fragments that overlap with the conserved elements. (E-F) AUC values 

of the prediction models based on transition (red bars) or K-mer (blue bars) frequencies, which were trained and tested by 
human (E) and mouse (F) IncRNAs, respectively. Here all the human/mouse PRC2-positive and PRC2-negative IncRNAs were 
separately divided into two subgroups of equal size according to their length, termed as the moderately and extremely long 
ones, to assess the model performance on IncRNAs of different length. (Adapted from Tu et al, Scientific Reports 2017) 


for the function of these IncRNAs. This 
observation provides a different viewpoint 

to understand the low conservation level of 
ncRNAs in mammal genomes, and implies 
evolutionary analysis can still serve as a useful 
tool for identifying functional elements for 
ncRNAs. Taken together, our findings indicate 
that, although the sequences of IncRNAs are 
of tremendous complexity, they still share 


quite a number of recurring patterns. Using 


these patterns as clues, prediction based on 
global and local sequence patterns can serve 
as a useful guide for experimental biologists 
to investigate their functions. For future 
studies, even more sophisticated models, 
e.g. nonhomogeneous Markov model, may 
be employed to further understand the 
heterogeneous sequence composition of 


IncRNAs. 
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by deregulating splicing. Sci Rep, 2017, 7, 
40488; doi: 10.1038/srep40488. 


+ Chen Y, Breeze CE, Zhen S, Beck S, 
Teschendorff AE". Tissue-independent and 
tissue-specific patterns of DNA methylation 
alteration in cancer. Epigenetics Chromatin, 
2016, 9 (1), 10. 


+ Huang J*, Liu X*, Li D*, Shao Z*, Cao H, 
Zhang YN, Trompouki E, Bowman TV, 
Zon LI, Yuan GC, Stuart H. Orkin SH‘, Xu J*. 
Dynamic Control of Enhancer Repertoires 
Drives Lineage and Stage-Specific 
Transcription during Hematopoiesis. Dev 
Cell, 2016, 36, 9-23. 


+ Das PP*, Hendrix DA*, Apostolou E, Buchner 
AH, Canver MC, Beyaz S, Ljuboja D, Kuintzle 


R, Kim W, Karnik R, Shao Z, Xie H, Xu J, De 
Los Angeles A, Zhang Y, Choe J, Jun DL, 
Shen X, Gregory RI, Daley GQ, Meissner A, 
Kellis M, Hochedlinger K, Kim J, Orkin SH’. 
PRC2 Is Required to Maintain Expression 
of the Maternal Gtl2-Rian-Mirg Locus by 
Preventing De Novo DNA Methylation 

in Mouse Embryonic Stem Cells. Cell Rep, 
2015, 12 (9), 1456-1470. 


+ Xu J*, Shao Z*, Li D, Xie H, Kim W, Huang 


J, Taylor JE, Pinello L, Glass K, Jaffe JD, Yuan 
GC, Orkin SH#. Developmental control of 
Polycomb subunit composition by GATA 
factors mediates a switch to non-canonical 
functions. Mol Cell, 2015, 57 (2), 304-316. 


Cooperation 


- Epigenetic regulation in Cancers. Prof. 


Jian Xu, Children’s Medical Center Research 
Institute, Department of Pediatrics, 
University of Texas Southwestern Medical 
Center, Dallas, TX, US. 


- Regulatory network of embryonic stem 


cells. Prof. Qiurong Ding, Institute of 
Nutritional Sciences, SIBS, CAS, Shanghai, 
China. 


External Funding 


- Active: 


Hundred Talents Program 

Y516C11851, Chinese Academy of Sciences, 
China 

Principal Investigator: Zhen Shao 

07/01/15 - 12/30/17 


Cross-discipline Collaborative Team for 
“Omics of Immune Cells in Auto-immune 
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Diseases" 

Y646C11851, Chinese Academy of Sciences, 
China 

Principal Investigator: Xiaoming Zhang 
(Zhen Shao as a participant) 

07/01/14 — 06/30/17 


+ Completed: 
“Quantitative Comparison and Integrative 
Analysis of Epigenomic Data” 
14PJ1410000, Shanghai Science and 
Technology Commission, China 
Principal Investigator: Zhen Shao 
7/01/14 — 06/30/16 


Teaching 


+ Bioinformatics and Algorithms, course for 
1% year graduate students at the Shanghai 
branch of the CAS graduate school, 
together with Professors from PICB (3 
lectures: “Probability”, “Analysis of sequence 
features” and “Integrative analysis of multi- 
omic data’). 


Invited Talks 


+ 08/2016 Youth Bioinformatic PI Forum, 
Beijing, China 
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4.7 Computational Systems Genomics 


Researchers: Students: á F 
Dr. Andrew Teschendorff Gao Yang r 
(Principal Investigator) Tian Yuan 

Phone: +86 021 54920659 Chen Yuting 

Email: andrew@picb.ac.cn Zheng Shijie 
Staff: 


Dr. Zhen Yang 


Research and (iv) improving our understanding of the 
systems-level principles underlying cancer. 

Overview 

The Computational Systems Genomics group Systems Epigenomics of Ageing 


was established in September 2013 and is and Cancer 


headed by Andrew Teschendorff, who joined 
as PI from the UCL Cancer Institute, University Current State of Research 


College London, UK. 
Previous studies performed by us and 


Our research group is purely computational/ other groups have demonstrated the 


in-silico based. We develop and apply novel importance of DNA methylation (DNAm) 


statistical methods to help analyze and changes in aging and the earliest stages of 
carcinogenesis (Teschendorff et al Genome Res. 
2010 & Genome Med. 2012). DNA methylation 


changes in particular have been shown 


interpret complex multi-dimensional omic 
data, with a special focus on Cancer Systems 


Epigenomics. Novel data is analyzed via our 


collaborations with leading clinicians and to measure chronological age remarkably 


biologists. As such our research goals are well, exemplified by Horvath’s epigenetic 


strongly aligned with those of P4 (preventive, clock. Furthermore, deviations between the 


participatory, personalized and predictive) De eee era reoc 


Medicine. Specifically, our research is aimed and chronological age have been proposed 


at (i) identifying biologically and clinically as biomarkers of biological age. However, 


relevant biomarkers for the early detection as recently reviewed by us (Zheng SC et al 


and risk prediction of common cancers, (ii) Epigenomics 2016), the relevance of Horvath’s 


defining clinically relevant cancer taxonomies SPIE OH LaCR IOP, ani Charice ameeriiek 


for personalized medicine, (ii) elucidating prediction is unclear. Partly, this is because 


the role of epigenetics in ageing and cancer, Horvath’s clock is not a mitotic clock. Instead, 
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we recently hypothesized and demonstrated 
that an epigenetic mitotic clock, which 
approximates the relative number of stem- 
cell divisions in a given tissue across a number 
of different individuals, could serve as a useful 
cancer risk stratification tool (Teschendorff AE 
et al Genome Biol. 2016). Specifically, we were 
able to construct an epigenetic clock (called 
“epiTOC’) which correlated with the estimated 
lifetime number of stem-cell divisions across 
many different normal tissue types, and which 


acceleration in cancer and precancerous 
lesions (Figure-1, TeschendorffAE et al Genome 
Biol. 2016). Moreover, we showed how epiTOC's 
tick rate (unlike that of Horvath’s clock) appears 
accelerated in normal cells exposed to a major 
carcinogen (smoking). Our work shows how 
DNAm at specific CpG sites that start out 
unmethylated across a large number of fetal 
tissue types, accrue with age in line with the 
stem-cell division rate of the tissue, and how 
this rate is further modulated by cancer risk 


unlike Horvath’'s clock, predicts universal age- factors. 
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Figure-1. An epigenetic mitotic clock for cancer risk prediction. A) Correlation of the tick-rate of an epigenetic mitotic 
clock (pcgtAge, y-axis) with the estimated total number of lifetime cell divisions per stem-cell (TNSC, x-axis) for normal 
samples of different tissue types. B) Lack of a corresponding correlation for Horvath’s epigenetic age-acceleration 
measure, confirming that Horvath’s clock is not a mitotic clock. C & E) Association of the epigenetic mitotic clock 
with in-situ carcinomas and invasive cancer for lung and breast, with corresponding ROCs showing discrimination 
accuracy between normal and in-situ carcinomas. D & F) Corresponding lack of a positive correlation for Horvath’s 


age-acceleration measure. 


PI INDEPENDENT RESEARCH GROUPS 


Recently, we also completed the first 
Epigenome-Wide Association Study (EWAS) 
in buccal squamous epithelial cells, a source 
of tissue which may serve as an excellent 
surrogate for a wide range of epithelial 
cancers. Our EWAS in over 790 samples 


demonstrated that buccal cells acquire 
substantial DNAm changes that correlate 

with lifetime smoking exposure (Teschendorff 
AE et al JAMA Oncol. 2015). We showed that 
buccal cells acquire substantially more DNAm 
alterations than blood cells from the same 
individual, in line with the fact that buccal 
cells are directly exposed to the carcinogen. 
Importantly, we showed that a smoking DNAm 
signature derived from the buccal samples 
could discriminate cancer from normal tissue 
with a very high AUC (>0.95), irrespective 

of tissue-type, suggesting that there is a 
common biological process underpinning 
DNAm alterations in cancer and those seen in 
normal cells exposed to a major carcinogen. 
This common biological process is likely to be 
an increased cell-division rate, in line with our 


epigenetic mitotic clock work. Supporting this 
further, are two other studies published by 
our group, where we show that on average 
60% of the aberrant DNAm landscape of one 
tumour can be explained by the aberrant 
DNAm profile of another tumour, even if from 
a different cancer-type (Chen Y et al Epigenetics 
& Chromatin 2016), as well as another study 
where we show that all ER+ breast cancer 
types carry the same DNAm alterations, with 
the only difference being the level of DNAm 
deregulation which increases in line with a 
breast tumour’s proliferation rate (Gao Y et 

al Clin. Epigenetics 2015). Most importantly, 

the smoking-associated DNAm signature 
derived in buccal cells, when assessed in lung 


carcinomas in-situ (LCIS), allowed stratification 
of the LCISs into those that progress into an 
invasive lung cancer (ILC) and those that do 
not (AUG0.88, 95%Cl: 0.76-1.00). If validated 
in a larger cohort, and if the same result could 
be obtained using buccal samples from LCIS 
patients, this would meet one of the biggest 
unmet challenges in the field of lung cancer: 
namely, the provision of a non-invasive test 
for the reliable identification of LCIS patients at 
highest risk of developing an ILC. 


Importantly, our smoking EWAS in buccal 
samples also provided deep novel insights 
into the potential role of DNAm changes in 
carcinogenesis, insights which we note were 
not obtained in previous smoking EWAS 
conducted in blood. Our analysis showed 
that most of the smoking associated DNA 
hypermethylation in buccal cells target 
binding sites of key transcription factors, 
including RAD21, a DNA repair enzyme, and 
CTCF, a transcription factor with a key role in 
specifying chromatin architecture (Figure-2). 
Of note, our analysis also showed that smoking 
associated DNA hypermethylation occur at 
promoters of developmental transcription 
factors. Intriguingly, we have observed 
similar enrichment of DNA hypermethylation 
at developmental TF promoters and CTCF 
binding sites in the context of aging (Yuan T 
et al PLoS Genetics 2015) and in normal tissue 
at risk of breast cancer (Teschendorff et al Nat 
Commun. 2016), suggesting a universal pattern 
of DNAm alteration in response to different 
cancer risk factors. 
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Figure-2. DNA methylation heatmap (blue/yellow denotes relative hyper/hypo-methylation) of 400 buccal samples 
making up the discovery set with samples ordered according to a smoking-index, measuring the absolute deviation 

in DNAm from the normal DNAm levels found in non-smokers. Right panel depict if the corresponding CpGs map to 
specific biological terms, including transcription-factor binding sites for 3 highly enriched transcription factors. 


Another line of investigation we have pursued 
is the integration of DNAm with gene 
expression with the aim of identifying putative 
functional drivers of the carcinogenic process. 
Our previous FEM (Eunctional Epigenetic 
Modules) algorithm, which integrated 

lumina Infinium 27k DNA methylation and 
gene expression data at a systems-level, 

and which was very successful in identifying 
an epigenetic driver of endometrial cancer 
(HAND2 gene, Jones et al PLoS Med. 2013), has 
now been extended to include Illumina 450k 
(Jiao Y et al Bioinformatics 2014) and EPIC 850k 
data (unpublished). 


In the context of aging, we performed one of 
the first integrative DNAm gene-expression 
analysis in whole blood, demonstrating that 
a substantial part of epigenetic drift is not 
driven by changes in cell-type composition 


and that most of the age-associated DNAm 
alterations that occur in promoters do not 
result in in-cis gene expression changes, 
because the direction of DNAm alteration acts 
to stabilize gene expression levels (Yuan T et al 
PLoS Genetics 2015). 


In the context of cancer, we have recently 
performed four integrative pan-cancer wide 
analyses using matched multi-dimensional 
TCGA data. One study integrated gene 
expression data from human embryonic 
stem cells, fetal tissue and adult normal 
cells, with DNAm, gene expression, copy- 
number and mutational data for a number 
of matching cancer-types from the TCGA, 
in order to evaluate the significance of 
transcription factor silencing events in 
carcinogenesis (Teschendorff et al Genome 
Med. 2016). Importantly, this study revealed 
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Figure-3. A) Pan-cancer wide correlation heatmaps between expression of epigenetic enzymes (EE) and global 

aberrant DNA methylation indices (separately for aberrant hypermethylation-HyperZ and for aberrant hypomethylation- 
HypoZ). EEs exhibiting significant and consistent correlative patterns across all cancer-types are indicated in red for those 
consistently overexpressed in cancer, and in green for those consistently underexpressed. B) Expression correlation network 
of the EE genes selected in A), identifying a core module and independent regulators of the cancer DNAm landscape. 


that transcription factors which are 

important for the specification of a tissue- 
type are preferentially underexpressed in 

the corresponding cancer-type, and that this 
underexpression is preferentially associated 
with promoter DNA hypermethylation and not 
with genomic loss or inactivating mutations. 
This demonstrates that TF silencing is an 
important event in carcinogenesis and further 
highlights the potentially important role of 
DNAm alterations in early carcinogenesis, 

as these specific DNAm alterations are seen 

in normal cells as a function of age and 


other cancer risk factors. In a second pan- 
cancer wide study using TCGA data, we 
performed an integrative analysis of DNAm 
and gene expression focusing on a large 
class of epigenetic enzymes, with the aim 

of identifying putative epigenetic master 
regulators of cancer (Yang Z et al Genome Biol. 
2015). Our study successfully demonstrated the 
existence of universal patterns of epigenetic 
deregulation that transcend cancer-type, and 
pinpointed a handful of epigenetic enzymes 
which appear to drive the aberrant DNAm 
landscape in cancer (Figure-3). 
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In a third pan-cancer wide TCGA study, we 
have integrated DNAm data with independent 
chromatin-state data from IHEC and ENCODE, 
to discern the molecular rules governing the 
aberrant DNAm patterns in cancer (Chen Y et 
al Epigenetics & Chromatin 2016). Importantly, 
our data shows that the H3K27me3 mark 

in normal cells is the most predictive mark 


of promoter hypermethylation in the 
corresponding cancer type, a result which 

has been shown to be valid across several 
different cancer types. Moreover, we show 
that H3K4me3 signal in normal cells reduces 
the likelihood of promoter hypermethylation 
in cancer. Finally, our 4" pan-cancer wide 
TCGA study has mapped putative epigenetic 
and copy-number cancer drivers onto a 
human protein interaction network, in order 
to determine whether such driver alterations 
exhibit different systems-level properties (Gao 
Y et al Nucleic Acids Res. 2016). Importantly, 

this study has shown that putative functional 
epigenetic drivers map preferentially onto the 
extracellular and transmembrane domains 

of signaling pathways, in stark contrast with 
functional copy-number or mutational events, 
which tend to occur in the intra-cellular 
pathway domain. Our work further pinpointed 
key signaling pathways (WNT and chemokine 
signaling) which are prone to epigenetic 
deregulation in their extracellular domains. 


Future Perspective 


In the foreseeable future, DNA methylation will 
remain as the only epigenetic mark that can 
be reliably measured genome-wide in large 
numbers of clinical samples. Thus, it will remain 
as the epigenetic marker of choice for cancer 


epigenome (e.g. TCGA/ICGC) and Epigenome 
Wide Association studies (EWAS). However, it is 
clear that interpretation of studies measuring 
only DNAm are limited. There is an increasing 
need for integrative analyses which combine 
DNAm with matched gene expression data 

or with unmatched ChIP-Seg and chromatin 
data from international consortia such as 


IHEC and Blueprint. Thus, our future work in 
Systems Epigenomics aims to contribute novel 
statistical and bioinformatic methods for such 
integrative analysis with two ultimate goals in 
mind: (i) identification of cancer biomarkers 
and epigenetic aberrations predisposing or 
causing cancer, (ii) an improved systems- 

level understanding of how the epigenome is 
altered in ageing and cancer. 


+ Systems-level integration of DNA methylation, 
gene expression and chromatin network 
data: We are in the process of developing 
an algorithm which integrates a cell-type 
specific regulatory network, constructed 
using cell-type specific open chromatin 
(DHS) from IHEC/ENCODE and TF-PWM 
and/or ChIP-Seq data, with Illumina 450k/ 
EPIC DNAm and matched gene expression 
data, to identify the regulatory subnetworks 
which are disrupted in cancer. We will apply 
this algorithm to lung squamous epithelial 
cells and lung squamous cell carcinoma 
(LSCC), in addition to our EWAS on buccal 
squamous epithelium in order to identify 
putative smoking-associated DNAm 
changes which drive the development of 
LSCC. 


+ Epigenetic aging and cancer: Substantial 
progress has been made with the 
development of epiTOC (an epigenetic 
mitotic clock) and demonstrating its 
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different properties to Horvath's epigenetic 
clock. However, the usefulness of our 
epigenetic mitotic clock for cancer risk 
prediction needs to be further assessed, 
and ideally in the context of prospective 
studies. We will do so with collaborative 
partners at UCL London and elsewhere, 

as part of the Horizon 2020 FORECEE 
program. A separate outstanding 

question concerning age-associated 
epigenetic alterations is the identification 
of alterations that have functional impact 
and, in particular, to study if age-associated 
DNAm changes which target tissue- 


specific transcription factors are able to 
predict the prospective risk of neoplastic 
transformation, in line with our overarching 
hypothesis that blocks to differentiation are 


an early cancer-predisposing event. 


+ Epigenetic landscapes in cellular development, 
aging and cancer: One of the most exciting 
possibilities offered by multi-dimensional 
omic data, as generated by IHEC, is the 
construction of “epigenetic landscapes", 
which aim to build predictive models 

for understanding cell-fates and cell-fate 
transitions. Integrating matched histone 
mark, gene expression and DNAm data 

for different cell-types offers a means of 
building approximations to these energy 
potential landscapes. By then studying how 
these activity patterns are altered in aging 
and cancer we hope to unravel system- 
level principles underlying these complex 
phenotypes. 


Intra-sample cell-type 
heterogeneity data: novel 
algorithms for cell-type 
deconvolution 


Current State of Research 


Confounding factors pose a major statistical 
challenge for the reliable inference of 
biomarkers from large omic studies. One 
confounding factor that has received 
particular attention in the context of gene 
expression and epigenome studies is intra- 
sample cell-type heterogeneity. Not adjusting 
for cell-type heterogeneity can dramatically 


impact on estimates of statistical significance 
as well as obscuring potential causal 


associations with disease. Currently, there are 
two main statistical inference paradigms being 
considered. Reference-based methods use 
defined molecular profiles of representative 
cell-types as a reference, to then project 
sample profiles onto these reference profiles. 
This procedure infers cell-type fractions in 
each sample. Hence, these approaches require 
knowledge of the main underlying cell-types 
in the tissue of interest, as well as molecular 
profiles representing these cell-types. In 
contrast, reference-free methods do not 
require reference profiles and as such are, in 
principle, applicable to any tissue type. 


Given the critical importance of adjusting 

for intra-sample cell-type heterogeneity, 

it is of paramount importance to compare 
and evaluate different statistical cell-type 
deconvolution methods, as comprehensive 
and unbiased comparisons have not yet been 
performed. We have recently conducted a 
comprehensive comparison of reference-free 
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to reference-based methods, in the context of 
Epigenome-Wide Association Studies (EWAS), 
including studies profiling DNA methylomes of 
cancer tissue (Zheng SC et al Nat Methods 2017 
In Press). Our analysis has led to the following 
novel insights: (i) First, for tissues where the 
underlying cell-type composition is reasonably 
well known (e.g. whole blood), reference- 
based methods are generally preferable as a 
means of adjusting for cell-type heterogeneity. 
(ii) Among reference-based methods, we 

find that Houseman’s constrained projection 
algorithm does not always perform optimally, 
and that non-constrained approaches based 
on Support Vector Regression and Robust 
Partial Correlations are preferable. (iii) Among 
reference-free methods, we find that previous 
supervised methods such as Surrogate 
Variable Analysis or Independent Surrogate 
Variable Analysis, which were developed for 
dealing with unknown general confounders, 
obtain the best compromise between 
sensitivity and specificity, as well as being the 
most robust. 


Future Perspective 


Over the next five years, cell-type 
deconvolution will remain as the one of 
the most important statistical issues to 
address. Single-cell analysis remains very 
costly and limited to the profiling of only a 
relatively small proportion of cells. Current 


and future Epigenome-Wide Association 
Studies, as well as the International Cancer 
Genome Consortium (ICGC), are profiling 
thousands of samples derived from complex 
tissues. Moreover, although many cell- 

type deconvolution algorithms exist, and 
despite the insights provided by our recent 


comparative study (Zheng SC et al Nat Methods 
2017), it is still unclear which methods are the 
most reliable, as this may depend critically on 
specific features of a study such as tissue type 
and phenotype of interest. Addressing these 
key issues requires comparative analysis in 
significant numbers of independent datasets, 
for instance, more gene expression or DNA 
methylation datasets with matched flow- 
cytometric estimates. 


Besides scaling up analyses, there are also 
many other unresolved statistical challenges. 
One challenge is the identification of 
molecular alterations which occur in specific 
cell-types, since all current algorithms only 
allow inference of molecular alterations 


which are independent of changes in cell- 
type composition without informing on 

the specific cell-types which carry such 
alterations. Identifying which specific cell- 
types carry the molecular alteration (e.g. a 
DNAm change or mRNA expression change) 


is key in order to identify potential causal 
alterations. For instance, being able to identify 
DNAm changes that occur in specific T-cell 
subsets may be critical for understanding 


ct 


he effectiveness of immune responses in 
cancer. Thus, one of our future goals will 

be to develop a novel statistical cell-type 
deconvolution approach which will allow us 
to infer molecular alterations in individual cell- 
types. 


In addition, we are also currently 

exploring improved methods for cell-type 
deconvolution. One method is based on the 
concept of “semi-reference-free” inference, 
which uses a specific level of prior biological 
information in the inference procedure 
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but does not impose the rigidity of a full 
reference-based approach. Since both gene 
expression and DNA methylation are regulated 
at a signaling pathway level, we are also 
currently exploring semi-reference-free graph 
deconvolution approaches which would first 
incorporate pathway-level information as prior 
biological knowledge in the form of a network 
and which would subsequently perform 
cell-type deconvolution via blind source 
separation on this network. 


Single-cell RNA-Seq analysis also provides 

an interesting arena in which to assess cell- 
type deconvolution methods. With such 
data, the impact of cell-to-cell heterogeneity 
on inference can be evaluated, as well as the 
potential impact of inter-cellular interactions, 


which may alter the in-vivo molecular profiles 
in ways which reference profiles are not be 
able to capture. We plan to study how big a 
limitation cell-cell interactions may present to 
cell-type deconvolution for reference-based 
approaches in comparison to reference-free 


and semi-reference-free methods. 


Statistical Methods for Preventive 
and Personalized Medicine 


Current State of Research 


Cancer incidence is set to increase and 

to present a major economic and social 
burden to modern society. Two key areas 
where cancer management needs urgent 
improvement are (i) risk prediction & early 
detection, and (ii) personalized medicine. Risk 
prediction biomarkers may offer preventive 


strategies, whilst novel early detection tools 


are key for improving survival rates. For those 
cancers, for which risk prediction or early 
detection is hard, personalized targeted 
treatments are important to improve survival 
and to reduce the potentially harmful effects 
of aggressive chemotherapy. However, many 
Statistical challenges have emerged which 
need to be overcome in order for preventive 
and personalized medicine to realize its full 
potential. 


One statistical challenge we have 
encountered relates to the nature of the 
epigenetic changes seen in the earliest 
stages of carcinogenesis. In a previous study 
conducted on liquid-based cytology cervical 
smear samples (Teschendorff et al Genome 
Med. 2012), we demonstrated, using a novel 
statistical algorithm called EVORA (Epigenetic 
Variable Outlier for Risk prediction Analysis), 
that DNA methylation changes measured in 
cytologically normal epithelial cells, 3 years 

in advance of neoplastic transformation, can 
predict the risk of such transformation. We 
observed that DNA methylation alterations 
occurring in normal epithelial cells at risk of 
neoplastic transformation appear to occur 

in a stochastic fashion across independent 
individuals, although they are distinctively 
non-random with regards to the genomic 
loci where the alterations are found. As 

a result of the stochastic nature of these 
epigenetic alterations, we demonstrated that 
an entirely different statistical paradigm for 
feature selection is needed, one based on 
the notion of differential variability (DV). DV 
provides a framework in which to identify 
DNA methylation outliers, which we posit 
mark normal cells that are at risk of neoplastic 
transformation. Confirming our hypothesis, 
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Identification of Epigenetic Field Defects using the iEVORA algorithm 
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Figure-4. A, C & D) Flowchart of the procedure used to identify and validate DNAm field defects in breast cancer, 
using the iEVORA algorithm. B) Illustration of the difference between performing feature selection based on the 
common paradigm of differential mean methylation, or the novel paradigm based on differential variability. 
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Figure-5. A) Comparative histograms of P-values testing for differential variability and differential means between normal 
breast and normal-adjacent tissue. B) An example of a differentially variable CpG (mapping to FZD2) between normal and 
normal-adjacent breast tissue. C) Heatmap of DNAm values for the 42-normal adjacent samples across the differentially 
variable CpGs identified using iEVORA. Samples ranked according to the fraction of field-defects (FD) they carry. D) 
Validation of the power of the identified DNAm field defects to discriminate normal (N) from normal-adjacent (NADJ) breast 
tissue in an independent dataset. 
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we recently conducted a similar study in 
the context of breast cancer (Figure-4, 
Teschendorff et al Nat. Commun. 2016). 


In this study, the normal cells at risk of 
neoplastic transformation were sampled 
from normal tissue found adjacent to breast 
cancers. By comparing normal breast tissue 
from healthy women to normal breast tissue 
from age-matched women with breast 
cancer, and using EVORA, we were able to 
demonstrate and validate the existence of 
widespread DNAm field defects (Figure-5, 
Teschendorff et al Nat. Commun. 2016). 


Given the importance of DV as a novel feature 
selection paradigm, there has been interest 

in the community to develop novel improved 
DV algorithms. Of particular importance is the 
development of DV algorithms that offer a 
reasonable compromise between sensitivity 
and the type-1 error-rate. Indeed, following 
our original study in 2012, a number of other 


DV algorithms were proposed, which offer 
improved control of the type-1 error rate. 
However, we have very recently demonstrated 
that these other algorithms lack the sensitivity 
to detect DNAm field defects in normal cells 
at risk of neoplastic transformation, both in 
the context of cervical and breast cancer 
Teschendorff et al BMC Bioinformatics 2016). 
Thus, currently, EVORA, and an improved 
version called iEVORA (Teschendorff et al Nat. 
Commun. 2016) are the only algorithms which 


have been conclusively shown to identify 
candidate epigenetic field defects in cancer. 


More recently, as part of a collaboration 
with Blueprint and the International Human 
Epigenome Consortium (IHEC), we have 


been exploring the concept of DV in the 
context of an EWAS for type-1 diabetes (Paul 
DS, Teschendorff AE et al Nat. Commun. 2016). 
Unique to the study was the use of a large 
number of monozygotic twins discordant for 
disease status, as well as the DNAm profiling 
in 3 different immune-effector cell-types 
(CD4+ T-cells, CD19+ B-cells and CD14+ 
Monocytes) for each individual twin pair. 

We demonstrated that while there were no 
differentially methylated positions at genome- 
wide significance, that IEVORA could pick out 
DNAm outliers which occurred predominantly 
in the type-1 diabetic twins. These DNAm 


alterations have been shown to target specific 
metabolic pathways and supports the view 


that using DV as a novel feature selection 
paradigm may also be relevant in diseases 
beyond cancer. 


We have also pursued alternative systems- 
level methods for the identification of cancer 
risk biomarkers in DNA methylation studies. 
Viewing the transition from a normal “at risk 
of cancer” state to an invasive cancer stage 
as a critical phase transition, it is possible 


to formulate a mathematical model with 
underlying universal laws which makes 


specific predictions as to the behavior of 
macroscopic observables as the critical 
phase transition point is approached. One 

of these laws predicts an increased variance 
and covariance of specific molecular features 
(in our case these will be DNAm patterns at 
specific CpG sites), which can be measured 
from longitudinal time course data or from 
cross-sectional population cohort data. Given 
that time-course data of cancer progression 
in the tissue of origin is impossible to acquire, 
our effort has focused on using cross-sectional 
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population cohort DNA methylation data 
representing three major disease stages 
(normal no-risk, pre-cancer and invasive 
cancer). Importantly, we demonstrated tha 


ot 


the underlying phase transition model of 
disease progression can successfully identi 


= 
< 


cancer risk CpGs in the context of cervica 
carcinogenesis (Teschendorff et al PLoS Comp. 
Biol. 2014). Risk CoG modules obtained using 
this approach have been dubbed Dynamic 
Network Biomarkers (DNBs). 


In the context of personalized medicine, 
our focus has been on developing statistical 


methods to help identify cancer-driver 


aberrations in individual tumor samples. Our 


approach, called DART (Denoising Algorithm 


using Relevance network Topology) integrates 
a large training gene expression data set, 


representative of the cancer of interest, 

with a large database of perturbation gene 
expression signatures, many of which have 
been derived in-vitro by knock-down/ 
overexpression of tumour suppressor/ 
onco-genes (Teschendorff et al Genome Biol. 
2015). Unlike other statistical methods, DART 
evaluates the consistency of perturbation 
signatures in the in-vivo training data, and 
further denoises perturbation signatures using 
this large training set. The algorithm learns 


a clique-like module of highly correlated 
genes which make a subset of the original 


perturbation signature, and uses this module 
to infer perturbation activity of independent 
tumour samples. A key advantage of the 
DART algorithm is that it allows in principle 
the dominant perturbations in each individual 
tumour to be identified (assuming that the 
dominant perturbations are those which 
determine the transcriptomic profile of 


the tumour). We demonstrated the added 
value of DART in the context of ER+ breast 
cancer. Specifically, DART was successful in 
identifying an AKT-signaling gene module 
which simultaneous predicts non-response 
to endocrine therapy and sensitivity to AKT- 
signaling drug inhibitors. Thus, our study 
suggests that the subgroup of ER+ breast 
cancer patients who do not respond well to 
tamoxifen or aromatase inhibitors may instead 
benefit from treatment targeting the AKT 


signaling pathway. 


Future Perspective 


Identification of DNA methylation based cancer 
risk biomarkers: We are part of the Horizon 2020 
EU funded FORECEE (Female Cancer Prediction 
using Cervical Omics to Individualise Screening 
and Prevention) (4C) program, which aims to 
develop risk prediction and early detection 
tests for the 4 main women-specific cancers 
(breast, ovarian, endometrial and cervical 
cancer). The project will measure DNA 
methylation (using Illumina EPIC beadarrays) in 
3 different tissues (Cervical smears, blood and 
buccal) in over 10,000 women. The hypothesis 
and rationale for measuring DNAm in cervical 
cells is that they represent an easily accessible 
source of hormone sensitive epithelial cells, 
thus making them the ideal surrogate for 
inaccessible tissues such as fallopian tube or 
breast. Hence, cervical cells allow, in principle, 
DNAm fingerprints associated with hormonal 
risk factors (which are main risk factors for 
breast, ovarian and endometrial cancer) to 

be identified. Buccal cells will serve as control 
since these are epithelial but hormone 
insensitive cells, whereas profiling blood will 
allow epigenetics defects in immune-cell 
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types to be identified, with immune-system 
defects thought to contribute substantially 
to cancer risk in all 4 main women-specific 
cancers. We will be leading the statistical 
and computational analysis of the data. One 
immediate aim will be the application of 
iEVORA to predict future high grade intra- 
epithelial cervical neoplasia (CIN3+) using 


prospectively collected cervical smear 


samples. In our earlier work performed on 
only 152 samples and using the old Illumina 
27k technology, we obtained an AUC of 
approximately 0.66 (P<0.05). As part of 4C, we 
will be using EPIC beadarrays (providing a ~30- 
fold increase in genome-wide coverage) to 
measure DNAm in over 1000 women, allowing 
us to rigorously assess whether DNAm can 


add predictive power over HPV infection 
status in predicting the future risk of CIN3+. 
An associated statistical challenge will be the 
need to extend iEVORA to account for intra- 
sample cellular heterogeneity, as well as the 
integration with DNAm profiles obtained in 
blood from the same women, in order to then 


incorporate information about the individual's 
immune system status in the risk prediction 
classifier. In addition, we will also be testing 
the alternative DNB risk prediction approach in 
the context of the 4 women-specific cancers 
using the 4C data). 


Improved algorithms for DV: We will continue 

to pursue further improvements to the DV 
algorithms. One particular aspect which needs 
exploring is the extension of such algorithms 
to identify differentially variable regions (DVRs). 
Immediate questions to address are whether 
DV CpGs (DVCs) occur in the context of DVRs 
of whether they represent single isolated 


events. Furthermore it will be important to 


assess the functional significance of DVCs 
and DVRs in datasets where matched gene 
expression data is available. 


A Network Physics framework for 
Cancer System-omics 


Current State of Research 


Over the last few years we have been 
advocating a novel theoretical framework for 
the analysis and interpretation of functional 
genomic data, which is aimed at trying to 
help elucidate systems biology principles 
underlying normal developmental biology 
(Banerji CR et al Sci Rep 2013) and how these are 
altered in cancer (West J et al Sci Rep 2012). The 
theoretical framework is based on the concept 


of “signaling entropy’, which we have shown 
can successfully classify cellular samples 
according to their differentiation potential 
within distinct cellular lineages (Banerji CR et al 
Sci Rep 2013). As shown by us, signaling entropy 
is also increased in cancer and in putative 
cancer stern-cells (Banerji CR et al Sci Rep 2013). 
Thus, in principle, the framework could be 
used to identify the key genes underlying 
cancer stemness. 


However, in order to make more progress 
there were two immediate challenges, 

which in the last 2 years we have successfully 
addressed. First of all, there is an important 
need to understand why signaling entropy 
provides such a highly discriminative marker of 
cancer, and in the case of normal cells, why it 


correlates so well with differentiation potency. 


Signaling entropy integrates the gene 
expression profile of a sample with a highly 
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curated signal transduction and interaction 
network, quantifying the overall amount of 
signaling uncertainty. We have shown that, 
mathematically, signaling entropy, is largely 
determined by the correlation value between 
the transcriptome and the connectome, 


i.e. the connectivity of the proteins in the 
network. Hence, we have been able to show 
that properties such as differentiation potential 


and cancer are encoded by two properties: 
1) a subtle positive correlation between 
the level of transcripts and the connectivity 
of the corresponding proteins, and (2) the 
approximate) scale-free nature of protein 


interaction networks (TeschendorffAE et al Sci 
Rep 2015). For instance, replacing the topology 
of the networks with that of a random graph 
does not result in signaling entropy providing 
a marker of differentiation potential or cancer. 


The second major challenge relates to the 
interpretation of signaling entropy, since all 
the samples analysed previously represent 
averages over large numbers of cells. Thus, 
there are in principle two contributions to 
ignaling entropy: a component that reflects 


Gay 


the amount of inter-cellular heterogeneity and 


ow 


cell-intrinsic component. Recently, we made 


theoretical progress in understanding and 
relating the properties of signaling entropy as 
measured over a population of cells (the “bulk”) 
to those of homogeneous cell populations 

or single cells (Banerji CR et al PLoS Comp 

Biol. 2015). Importantly, we were also able to 
demonstrate that signaling entropy provides 

a general predictor of clinical outcome in 
diverse cancer types (Banerji CR et al PLoS 
Comp Biol. 2015). 


More recently, we have been testing the 


signaling entropy concept on single-cell 
RNA-Seq data (scRNA-Seq) from over 6 major 
studies, encompassing over 7,000 single 

cell profiles, including cell-types from all 
main differentiation states and time-course 
differentiation experiments (Figure-6, 
Teschendorff AE bioRxiv 2016). Our analysis 
shows that signaling entropy provides a 
means of quantifying the differentiation 


potential of single-cells (Figure-7), a result 
which we also show is independent of cell- 
cycle phase, interaction network and NGS 
platform. Importantly, Signaling Entropy 

can discriminate single-cells according to 
differentiation potency without the need 

for feature selection and model selection 
(Figure-7). Signaling Entropy can therefore 
be used to order cells in “oseudotime’” i.e. 
according to the actual differentiation potency 
and can be combined with mixture modeling 
methods to infer novel cellular subtypes or to 
reconstruct cell-lineage trajectories, defining 
a complete algorithm which we call SCENT 
(Single Cell ENTtropy). We have shown that 
SCENT is successful in identifying subgroups 
of progenitor cells that differ in terms of 
differentiation potency. Based on preliminary 
data, signaling entropy may also provide 

a means of identifying putative cancer- 
stem-cell phenotypes. Importantly, having 
demonstrated the validity of the signaling 
entropy concept at the single-cell level, this 
now opens up a framework for understanding 
and relating entropy at the bulk population 
level to the distribution of single-cell entropies. 
Using this framework we have shown that 
inter-cellular transcriptomic heterogeneity is 
regulated to optimize differentiation potency 
at the cell population level, subject of course 
to extracellular signaling constraints. 
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Figure 6. A-B) The entropy rate (SR) of a cell represents a measure of the overall level of signaling promiscuity in the cell. 
This is expect to vary according to the differentiation potential of the cells. As shown by us, the signaling entropy rate can 
be estimated from transcriptomic data and provides a good proxy for the height in Waddington’s epigenetic landscape. 
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Figure 7. A) The signaling entropy rate (SR) decreases with differentiation potential of single-cells within the mesoderm 
lineage. The number of single cells in each differentiation stage is indicated below the violin plots. Right panels provide 
the ROC and AUC discriminating pluripotent from progenitor cells, and progenitor cells from terminally differentiated 
ones. B) As A), but now for a t-test based score of pluripotency (TPSC) computed from a published pluripotency gene 
expression signature. Note that SR performs comparably if not better than TPSC, despite the fact that computing SR is 
purely model-driven and does not involve any feature selection. 
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Future Perspective 


Signaling entropy and the associated 
SCENT algorithm have been successful in 
its application to single-cell RNA-Seq data. 
Given that our algorithm is purely model 
driven, it is now paramount to test the 
algorithm further and to also compare it to 
competing methods. For instance, we are 


planning to compare signaling entropy (as 
a differentiation potency marker) to two 


alternative measures which have recently 
been proposed by other groups. In contrast to 
these other two measures, signaling entropy 
integrates a single cell's transcriptomic profile 
with an interaction network, and thus we 

aim to demonstrate that the integration is 
critical for the improved accuracy of signaling 
entropy as a differentiation potency marker. 
The SCENT algorithm also needs to be tested 
further in the context of reconstructing 


cell-lineage trajectories from time-course 
scRNA-Seq differentiation experiments. We 
plan to perform detailed comparisons to 
existing algorithms such as Monocle, SCUBA, 
DiffusionMap and Wanderlust. The potential of 
signaling entropy to identify putative cancer 
stem-cell phenotypes from cancer scRNA- 

Seq data is another promising avenue we 

are exploring in collaboration with Prof Tariq 
Enver from UCL and Prof Colin Collins from the 
Vancouver Prostate Cancer Centre. 


From a theoretical perspective, having 

a model-driven approach to estimate 
differentiation potency and phenotypic 
plasticity of single-cells, allows us to study 
in-silico the effect of perturbations (e.g. 
inactivation of tumour suppressor genes or 
activation of oncogenes) on signaling entropy. 


We thus hope to gain a deeper understanding 
of cellular properties such as pluripotency 

and cancer. Progress in this direction may well 
require more sophisticated models which also 
incorporate regulatory network information, 
which we will explore. Achieving this, signaling 
entropy may then also provide a framework 


in which to identify key transcription factors 
underlying cellular differentiation, as well 

as to identify the key hubs underlying the 
cancer and cancer-stem-cell phenotypes. 
Since signaling entropy provides an 
approximation to the energy potential in 
Waddington's epigenetic landscape, it could 
also be integrated with methods which 
attempt to construct epigenetic landscapes. 
Understanding from an epigenetic 
perspective what controls signaling entropy 
may also shed further light on the process 
of cellular differentiation and how this gets 
altered in cancer. We shall actively pursue such 
avenues of research over the next few years. 


General Information 


Publications (2014-2017) 


*Corresponding Author SJoint 1st 
Author 


+ Teschendorff AE*. Single-cell entropy 
for accurate estimation of differentiation 
potency from a cell's transcriptome. 

Oct 30, 2016 bioRxiv: https://doi. 
org/10.1101/084202. 


+ Zheng SC, Beck S, Jaffe AE, Koestler DC, 
Hansen KD, Houseman EA, Irizarry RA, 
Teschendorff AE*. Correcting for cell- 
type heterogeneity in epigenome-wide 
association studies: revisiting previous 


PI INDEPENDENT RESEARCH GROUPS 


analyses. Nat Methods. 2017. In Press. 


+ Teschendorff AE*, Breeze CE, Zheng SC 
and Beck S. A comparison of reference- 
based algorithms for correcting cell- 
type heterogeneity in Epigenome-Wide 
Association Studies. BMC Bioinformatics. 
2017 Feb 13;18(1):105. 


+ Gao Y, Teschendorff AE*. Epigenetic 

and genetic deregulation in cancer target 
distinct signaling pathway domains. Nucleic 
Acids Res. 2016 Nov 29. 


+ Yang Z, Wu L, Wang A, Tang W, Zhao Y, 
Zhao H, Teschendorff AE*. doDEMC 
2.0:updated database of differentially 
expressed miRNAs in human cancers. 
Nucleic Acids Res. 2016 Nov 28. 


+ Paul DS, Teschendorff AES, Dang MA, 
Lowe R, Hawa MI, Ecker S, Beyan H, 
Cunningham S, Fouts AR, Ramelius A, 
Burden F, Farrow S, Rowlston S, Rehnstrom 
K, Frontini M, Downes K, Busche S, Cheung 
WA, Ge B, Simon MM, Bujold D, Kwan 

T, Bourque G, Datta A, Lowy E, Clarke L, 
Flicek P Libertini E, Heath S, Gut M, Gut 

IG, Ouwehand WH, Pastinen T, Soranzo N, 
Hofer SE, Karges B, Meissner T, Boehm BO, 
Cilio C, Elding Larsson H, Lernmark Å, Steck 
AK, Rakyan VK, Beck S, Leslie RD. Increased 
DNA methylation variability in type 1 


diabetes across three immune effector cell 
types. Nat Commun. 2016 Nov 29;7:13555. 


+ Breeze CE, Paul DS, van Dongen J, Butcher 
LM, Ambrose JC, Barrett JE, Lowe R, Rakyan 
VK, lotchkova V, Frontini M, Downes K, 
Ouwehand WH, Laperle J, Jacques PE, 
Bourque G, Bergmann AK, Siebert R, 
Vellenga E, Saeed S, Matarese F, Martens 
JH, Stunnenberg HG, Teschendorff AE, 


Herrero J, Birney E, Dunham |, Beck S. 
eFORGE: A Tool for Identifying Cell Type- 
Specific Signal in Epigenomic Data. Cell Rep. 
2016 Nov 15;17(8):2137-2150. 


+ Yang Z, Wong A, Kuh D, Paul DS, Rakyan 


VK, Leslie RD, Zheng SC, Widschwendter 
M, Beck S, Teschendorff AE*. Correlation 
of an epigenetic mitotic clock with cancer 
risk. Genome Biol. 2016 Oct 3;17(1):205. 


+ Kalwa M, Hanzelmann S, Otto S, Kuo CC, 


Franzen J, Joussen S, Fernandez-Rebollo 

E, Rath B, Koch C, Hofmann A, Lee SH, 
Teschendorff AE, Denecke B, Lin Q, 
Widschwendter M, Weinhold E, Costa IG, 
Wagner W. The IncRNA HOTAIR impacts 
on mesenchymal stem cells via triple helix 
formation. Nucleic Acids Res. 2016 Dec 
15;44(22):10631-10643. 


+ Teschendorff AE*, Zheng SC, Feber A, 


Yang Z, Beck S, Widschwendter M. The 
multi-omic landscape of transcription 
factor inactivation in cancer. Genome Med. 
2016 Aug 25;8(1):89. doi: 10.1186/s13073- 
016-0342-8. 


+ Levine ME, Lu AT, Chen BH, Hernandez DG, 


Singleton AB, Ferrucci L, Bandinelli S, Salfati 
E, Manson JE, Quach A, Kusters CD, Kuh D, 
Wong A, Teschendorff AE, Widschwendter 
M, Ritz BR, Absher D, Assimes TL, Horvath 

S. Menopause accelerates biological 

aging. Proc Natl Acad Sci US A. 2016 Aug 
16;113(33):9327-32. 


- Doufekas K, Zheng SC°, Ghazali S, Wong 


M, Mohamed Y, Jones A, Reisel D, Mould 
T, Olaitan A, Macdonald N, Teschendorff 
AE, Widschwendter M. DNA Methylation 
Signatures in Vaginal Fluid Samples for 
Detection of Cervical and Endometrial 


PI INDEPENDENT RESEARCH GROUPS 


+ Teschendorff AE*, Jones A, 


+ Gao Y, Jones A, Fasching PA, Ruebner 


Cancer. Int J Gynecol Cancer. 2016 Jun 2. 


+ Zheng SC, Widschwendter M, 


Teschendorff AE*. Epigenetic drift, 
epigenetic clocks and cancer risk. 
Epigenomics. 2016 May;8(5):705- 


so 


Widschwendter M. Stochastic epigenetic 
outliers can define field defects in cancer. 
BMC Bioinformatics. 2016 Apr 22;17:178. 


+ Chen Y, Breeze CE, Zhen S, Beck S, 


Teschendorff AE*. Tissue-independent 
and tissue-specific patterns of DNA 


methylation alteration in cancer. Epigenetics 
Chromatin. 2016 Mar 8;9:10. 


+ Teschendorff AE*, Gao Y, Jones A, 


Ruebner M, Beckmann MW, Wachter DL, 
Fasching PA, Widschwendter M. DNA 
methylation outliers in normal breast tissue 
identify field defects that are enriched in 
cancer. Nat Commun. 2016 Jan 29;7:10478. 


M, Beckmann MW, Widschwendter 

M, Teschendorff AE*. The integrative 
epigenomic-transcriptomic landscape of 
ER positive breast cancer. Clin Epigenetics. 
2015 Dec 9;7:126. 


+ Carén H, Stricker SH, Bulstrode H, Gagrica 


S, Johnstone E, Bartlett TE, Feber A, Wilson 
G, Teschendorff AE, Bertone P, Beck 

S, Pollard SM. Glioblastoma Stem Cells 
Respond to Differentiation Cues but Fail to 
Undergo Commitment and Terminal Cell- 
Cycle Arrest. Stem Cell Reports. 2015 Nov 
10;5(5):829-42., 


+ Teschendorff AE, Lee SH, Jones A, 


Fiegl H, Kalwa M, Wagner W, Chindera 
K, Evans |, Dubeau L, Orjalo A, Horlings 


HM, Niederreiter L, Kaser A, Yang W, 
Goode EL, Fridley BL, Jenner RG, Berns 
EM, Wik E, Salvesen HB, Wisman GB, 

van der Zee AG, Davidson B, Trope CG, 
Lambrechts S, Vergote I, Calvert H, Jacobs 
IJ, Widschwendter M. HOTAIR and its 
surrogate DNA methylation signature 
indicate carboplatin resistance in ovarian 
cancer. Genome Med. 2015 Oct 24;7:108. 


+ Ma Z, Teschendorff AE, Leijon A, Qiao Y, 


Zhang H, Guo J. Variational Bayesian Matrix 
Factorization for Bounded Support Data. 
IEEE Trans Pattern Anal Mach Intell. 2015 
Apr;37(4):876-89. 


+ Lin ML, Patel H, Remenyi J, Banerji CR, Lai 


CF, Periyasamy M, Lombardo Y, Busonero 

C, Ottaviani S, Passey A, Quinlan PR, Purdie 
CA, Jordan LB, Thompson AM, Finn RS, 
Rueda OM, Caldas C, Gil J, Coombes RC, 
Fuller-Pace FV, Teschendorff AE, Buluwela 
L, Ali S. Expression profiling of nuclear 
receptors in breast cancer identifies TLX as 
a mediator of growth and invasion in triple- 
negative breast cancer. Oncotarget. 2015 
Aug 28;6(25):21685-703. 


- Teschendorff AE*, Yang Z*, Wong A, 


Pipinikas CP, Jiao Y, Jones A, Anjum S, Hardy 
R, Salvesen HB, Thirlwell C, Janes SM, Kuh D, 
Widschwendter M. Correlation of Smoking- 
Associated DNA Methylation Changes 

in Buccal Cells With DNA Methylation 
Changes in Epithelial Cancer. JAMA Oncol. 
2015 Jul;1(4):476-85. 


+ Yang Z, Jones A, Widschwendter M, 


Teschendorff AE*. An integrative pan- 
cancer-wide analysis of epigenetic enzymes 
reveals universal patterns of epigenomic 
deregulation in cancer. Genome Biol. 2015 


PI INDEPENDENT RESEARCH GROUPS 


Jul 14:16:140. 


+ Teschendorff AE*, Banerji CR, Severini 

S, Kuehn R, Sollich P. Increased signaling 
entropy in cancer requires the scale-free 
property of protein interaction networks. 
Sci Rep. 2015 Apr 28;5:9646. 


+ Pipinikas CP, Dibra H, Karpathakis A, Feber 
A, Novelli M, Oukrif D, Fusai G, Valente R, 
Caplin M, Meyer T, Teschendorff A, Bell C, 
Morris TJ, Salomoni P, Luong TV, Davidson B, 
Beck S, Thirlwell C. Epigenetic dysregulation 
and poorer prognosis in DAXxX-deficient 
pancreatic neuroendocrine tumours. 
Endocr Relat Cancer. 2015 Jun;22(3):L13-8. 


+ Teschendorff AE*, LiL, Yang Z. 
Denoising perturbation signatures reveal 


an actionable AKT-signaling gene module 
underlying a poor clinical outcome in 
endocrine-treated ER+ breast cancer. 
Genome Biol. 2015 Apr 2;16:61. 


+ Banerji CR, Severini S, Caldas C, 
Teschendorff AE*. Intra-tumour signaling 
entropy determines clinical outcome in 
breast and lung cancer. PLoS Comput Biol. 
2015 Mar 20;11(3):e1004115. 


+ Banerji CR, Knopp P, Moyle LA, Severini 

S, Orrell RW, Teschendorff AE, Zammit 
PS. B-Catenin is central to DUX4-driven 
network rewiring in facioscapulohumeral 
muscular dystrophy. J R Soc Interface. 2015 
Jan 6;12(102):20140797. 


- Yuan T°, Jiao Y°, de Jong S, Ophoff RA, 
Beck S, Teschendorff AE*. An integrative 
multi-scale analysis of the dynamic DNA 
methylation landscape in aging. PLoS Genet. 
2015 Feb 18;11(2):e1004996. 


+ Shen B, Teschendorff AE, Zhi D, Xia J. 


Biomedical data integration, modeling, 
and simulation in the era of big data and 
translational medicine. Biomed Res Int. 
2014;2014:731546. 


+ Anjum S, Fourkala EO, Zikan M, Wong A, 


Gentry-Maharaj A, Jones A, Hardy R, Cibula 
D, Kuh D, Jacobs UJ, Teschendorff AE, 
Menon U, Widschwendter M. A BRCAI- 
mutation associated DNA methylation 
signature in blood cells predicts sporadic 
breast cancer incidence and survival. 
Genome Med. 2014 Jun 27;6(6):47. 


+ Gomez-Cabrero D, Abugessaisa |, Maier D, 


Teschendorff A, Merkenschlager M, Gisel 
A, Ballestar E, Bongcam-Rudloff E, Conesa 
A, Tegnér J. Data integration in the era of 
omics: current and future challenges. BMC 
Syst Biol. 2014;8 Suppl 2:11. 


+ Teschendorff AE*, Liu X, Caren H, Pollard 


SM, Beck S, Widschwendter M, Chen L. The 
dynamics of DNA methylation covariation 
patterns in carcinogenesis. PLoS Comput 
Biol. 2014 Jul 10;10(7):e1003709. 


» Ma Z, Teschendorff AE, Yu H, Taghia J, Guo 


J. Comparisons of non-Gaussian statistical 
models in DNA methylation analysis. Int J 
Mol Sci. 2014 Jun 16;15(6):10835-54. 


+ Jiao Y, Widschwendater M, Teschendorff 


AE*. A systems-level integrative framework 
for genome-wide DNA methylation 

and gene expression data identifies 
differential gene expression modules under 
epigenetic control. Bioinformatics 2014 Aug 
15;30(16):2360-6. 


+ Steegenga WT, Boekschoten MV, Lute 


C, Hooiveld GJ, de Groot PJ, Morris TJ, 
Teschendorff AE, Butcher LM, Beck S, 
Muller M. Genome-wide age-related 
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changes in DNA methylation and gene 
expression in human PBMCs. Age (Dordr). 
2014 Jun;36(3):9648. 


Cooperation 


+ Risk Prediction of Common Women Cancers, 


Prof. Martin Widschwendter, Department of 
Women’s Cancer, University College Lon- 
don, London, UK. 


+ Bioinformatic analysis of Acute Aortic Dissec- 


tion, Prof. Xiangdong Wang & Dr Pan Sun, 
Department of Cardiac Surgery, Zhongshan 
Hospital and Fudan University, Shanghai, 
China. 


+ Statistical Methods for DNA methylation 


analysis, Dr. Zhanyu Ma, Pattern Recog- 
nition and Intelligent System Laboratory, 
Beijing University of Posts and Telecommu- 
nications (BUPT), Beijing China. 


- Epigenetic Phase Transition Models for Cancer 


Risk Prediction, Prof. Luonan Chen, Key State 
Laboratory for Systems Biology, CAS-SIBS, 
Shanghai, China. 


+ Role of the Nuclear Receptor Superfamily 


in Breast Cancer, Prof. Simak Ali, Faculty of 
Medicine, Imperial College, London, UK. 


+ Network Entropy in Systems Biology, Prof. 


Peter Sollich and Dr. Reimer Kuehn, De- 
partment of Mathematics, King’s College 
London, London, UK. 


+ Identification of causal DNAm alterations 


mediating the link between smoking and air 
pollution with lung cancer. This is being car- 
ried out in collaboration with Prof Caroline 
Relton (Bristol University) and Prof Paolo 
Vineis (Imperial College London). 


+ Deriving DNAm-based biomarkers of bio- 
logical aging. In collaboration with Prof 
Peter Adams (Glasgow University) and Prof 
Thomas von Zglinicki (Newcastle Universi- 
ty). 


+ Methods for the identification of differentially 
variable methylated regions. In collaboration 
with Dr Wang Shuang (Columbia University, 
USA). 


+ Framework for objective and comprehen- 
sive evaluation of cell-type deconvolution 
algorithms. In collaboration with Prof Eran 
Halperin (UCLA, USA). 


External Funding 


+ Newton International Advanced Fellowship 
“Dissection of Intra-Sample Epigenetic 
Heterogeneity with Blind Source Separation 
Algorithms” (Mar.2015-Mar.2018), Royal 
Society (UK) and Chinese Academy of 
Sciences. Amount: GBP 84,000. Main-PI. 


+ General Program Grant “Novel statistical 
bioinformatic methods for cancer systems 
epigenomics” Jan.2016-Jan.2020). National 
Science Foundation of China. Amount: RMB 
600,000. Main-PI. 


+ Female Cancer Prediction Using Cervical 
Omics to Individualise Screening and 
Prevention (FORECEE) (Sep.2015-Sep.2019). 
Horizon 2020 EU & Eve Appeal. Amount: 
EUR 8,000,000. Co-PI. 
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Newsletters/Highlighted 
Research (2014-2016) 


We have published 6 papers which either 
received high Altmetric scores, which were 
highly accessed, or which made it into news 
outlets. Details follow: 


+ Teschendorff AE et al Nat.Commun.2016 
(Altmetric 182 + SIBS/UCL Newsletter) 


+ Teschendorff AE et al JAMA Oncol.2015 
(Altmetric 64 + UCL Newsletter) 


« Yang Z et al Genome Biol. 2016 (Altmetric 
38, SIBS Newsletter and Highlighted on 
Genome Biol. website) 


+ Paul DS et al Nat.Commun.2016 (Altmetric 
70 + SIBS/UCL Newsletter) 


+ Teschendorff AE et al Genome Med.2016 
(>19,000 article accesses) 


- Yang Z et al Genome Biol. 2015 (SIBS 
Newsletter) 


Patents 


+ Prediction and Treatment of Cancer, M Wid- 
schwendter, A Teschendorff, P Shih Han- 
Lee, H Fiegl, WO Patent: 2,013,017,867. 


+ Method for Predicting Risk of Developing 
Cancer, MWidschwendter, A Teschendorff, | 
Jacobs, WO Patent: 2,012,104,642. 


Teaching (2014-2016) 


+ Dec.2014, Dec.2015, Dec.2016: each year 
lectured total of 9 hours “Statistical Analysis 
of Omics Data’, part of the MSc course 
“Bioinformatics Algorithms” 


Invited Talks (2014-2016) 


- “Dissecting epigenetic and transcriptomic 


heterogeneity: statistical challenges and 
solutions" Invited Keynote at the “Cancer- 
omatics Ill: Tumor Heterogeneity” 
conference, Madrid, Spain, 13th-16th 
Nov.2016. 


+ “Cancer system-omics: a statistical mechanics 


perspective”. Invited Keynote at the 
“Health-Exploring Complexity"-HEC2016 
Conference, Muenchen, Germany, 29th 
Aug.2016. 


+ “Multi-omics for women’s cancer: statistical 


challenges”. Invited Keynote at the “Annual 
Conference of the Institute for Women’s 
Health”, UCL, London, 14th Jun.2016. 


+ “Smoking-associated DNA methylation 


changes in buccal cells defines a universal 
cancer signature”. Invited talk at the 
“Epigenetics and Environmental Origins 
of Cancer (EEOC) Conference’, IARC, Lyon, 
France, 11-12" Jun.2016. 


+ “Intra-Sample Cellular Heterogeneity (ISCH) 


in Epigenome-Wide Association Studies”. 
Invited seminar at the Department of 
Epidemiology and Biostatistics in the 
School of Public Health, Imperial College 
London, 7th Jun.2016. 


+ “The role of DNA methylation alterations 


in early carcinogenesis”. Invited talk at the 
CSHA/AACR Joint Meeting - Big Data, 
Computation and Systems Biology in 
Cancer, Suzhou 2nd-Sth Dec.2015. 


+ “Statistical inference from big biomedical data: 


challenges & solutions”. Invited talk at the 6th 
ERTC conference on “Big Data in the Natural 
Sciences and Humanities’, Shanghai, 19th- 
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21% Nov.2015. 


+ “Novel insights and progress on stem-cell 
signatures in cancer”. Invited talk at the 
BIT Regenerative Medicine and Stem Cell 
Biology Congress 2015, Shanghai, 20th 
Nov.2015. 


+ “DNA methylation clocks”. Invited chalk- 
talk at the Nature “Epigenetics of Cancer 
and Aging" international meeting, Beijing 
Oct.15-17th 2015. 


+ “Epigenetic Drift, Ageing and Cancer’. Invited 
talk at the Epigenetics Discovery Congress, 
24th-25th Sept. 2015, London, UK. 


+ “Integrative Systems Epigenomics of 
Cancer”. Invited talk at the international 
Asian Biophysics Association (ABA) 2015 
Symposium, Hangzhou, China, 9-10th May 
2015. 


+ “Integrative Systems Epigenomics in Cancer’. 
Invited talk at the “Big Data in Cancer” 
workshop, University of Warwick, UK, 18th 
Mar.2015. 


+ “Integrative Methods for Cancer System- 
omics". Invited Keynote Speech at the FP7 
STATEGRA meeting “Statistical Methods 
for Omics Data Integration and Analysis”, 
Heraklion, Crete, Greece, 10-12th Nov.2014. 


+ “Signaling Entropy: understanding systems 
biology through uncertainty”. Invited talk at 
the “International Conference for Biological 
Physics (ICBP2014)", Beijing, China, 20th 
June 2014. 


n 


+ “Systems Epigenomics: applications to cancer’. 
Invited talk at the Centre for Bioinformatics, 
Shanghai University, Mar. 2014. 
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4.8 Functional Human Genetic Variation 


Researchers: Students 

Dr. Kun Tang (Principal Lu Qiao 

Investigator) Dan Li 
Phone: +86-21-54920213 Shiting Chen 


Email: tangkun@picb.ac.cn 


Staff 


Yingqi Zhang (Assistant Professor) 
Chen Liu (Research Associate) 
Nianhao Chen (Research Associate) 


Research 
Overview 


My Group has two related research directions: 
The first is to utilize large scale genomic 
polymorphism data to make fine inferences 
about human demographic history; The 
second is to use high resolution 3D facial 
images to question the genetic bases 
underlying the human facial morphology, 
and how human face maybe used to reflect 
other traits such as aging, wellbeing and 
personalities. 


For the first research direction, making 
inferences about human demographic history 
and selection events are the main tasks of 
Population Genetics. This research direction 
has seen rapid progresses with the erupting 
amount of genomic data. We aim at much 

full use of such rich genetic data to infer the 
details of the past selection events, and get a 
clear picture about what have happened to 
the birth the evolution of our species. 


The second research direction is focused 


on the human face. Human facial shape is a 


result of complex interplay of a large group 
of genetic factors. Our primary interest is 

to first understand the genetic bases and 
evolutionary mechanism of human facial 
shape variation. The core questions include: 
which loci are responsible for the variable 
facial shapes within/between populations? 
And whether we can reconstruct a human 
face to a high extent, purely based on the 
DNA sequence information? To answer these 
questions, new methods and technologies 
were developed and a novel analysis pipeline 
is emerging. Our preliminary data showed 
that facial shape, as a collection of complex 
traits, is a highly polygenic; and in principle it 
is possible to predict one’s face to accuracy 
meaningful to forensic applications. Other 
than this, human face as a rich source of 
complex traits, can be used as a BIG DATA 


to query interesting information, such as 
aging, disease, mood, personalities and 
attractiveness, etc. 


In the last few years, my group has pioneered 
into several new research areas and gained 
insights. More intriguing discoveries are to be 
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Figure 1. Timeline of PS signals in humans. Each dot represents a candidate PS signal. Genes that can be assigned to 
functional categories of strong relevance to human evolution were labeled in different colors and shapes (SI Appendix). 
(A) The PS events are plotted along a bigger time scale, for all three populations. A simplified approximate population 
history was constructed based on estimated demographic trajectories and known evidences, plotted as a background 
graph in light blue. Error bars are standard deviations of time estimates according to simulations for 0.5, 2, 4, 8 and 16 
kga. Ancient signals (= 1,900 generations) in YRI were classified into Nean-like, aEA-like and aYRl-rest by comparing with 
aEA. PARN, AUTS2, SORLI and SNCA show human-specific expression pattern in brain regions (Supplementary Note 5). 
The skeleton images of the four ancient/archaic individuals were adopted from the original papers(29, 41, 42), and placed 
at the assumed spatiotemporal coordinates. H: Human; C: Chimpanzee; R: Rhesus macaque; PFC: prefrontal cortex; 
CBC: cerebellum cortex. (B) Signals in CEU were illustrated in finer time scale for 4 groups: aFM-like, GHG-like, aFM-aHG 
common and CEU-rest. 
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Figure 2. Above, the altmetric ranking of this study; Below, the new coverage of this study 


of the genome sequences from three ancient 
AMHs and the Neanderthals. A series of brain 
function-related genes were found to carry 


expected in the near future. The individual 
projects are briefly summarized as follows: 


signals of ancient selective sweeps, which 
|. A Chronological Atlas of Natural may have defined the evolution of cognitive 
Selection in the Human Genome abilities either before Neanderthal divergence 


during the Past Half-million Years or during the emergence of AMH. Particularly, 
signals of brain evolution in AMH are strongly 


The spatiotemporal distribution of recent 
related to Alzheimer's disease pathways. In 


human adaptation is along standing question. 


nclusion, thi i 
We developed a new coalescent-based conclusion, this study reports a chronological 


A , las of ion i . 
method that collectively assigned human Aor Raa eee anuak 


genome regions to modes of neutrality or 


to positive, negative, or balancing selection. 
Most importantly, the selection times were 
estimated for all positive selection signals, 
which ranged over the last half million years, 
penetrating the emergence of anatomically 
modern human (AMH). These selection time 


This work is now pre-published in bioRxiv. 
org and has attracted intensive attentions 
from both peer researchers and the public. It 
now ranked the 45 out of all 8,800 papers in 
bioRxiv. And many media reported this study, 


including Nature News. 


estimates were further supported by analyses 
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Il. Detecting Genome-wide 
Variants of Eurasian Facial Shape 
Differentiation: DNA based Face 
Prediction Tested in Forensic 
Scenario 


ranging from inter-landmark distances to 
dense shape geometrics. Genome-wide 
association analyses were conducted on a 
discovery panel of Uyghurs. Six significant 
loci were identified four of which, rs1868752, 


rs118078182, rs60159418 at or near UBASH3B, 
COL23A1, PCDH7 and rs17868256 were 
replicated in independent cohorts of Uyghurs 


It is a long standing question as to which 
genes define the characteristic facial features 
w nat ee Te ‘ ie ke or Southern Han Chinese. A quantitative 
a a a Nall model was developed to predict 3D faces 
based on 277 top GWAS SNPs. In hypothetic 


forensic scenarios, this model was found to 


population to query the genetic bases why 
Europeans and Han Chinese look different. 


Facial traits were analyzed based on high- ee ne 
significantly enhance the verification rate, 


dense 3D facial images; numerous biometric , 

i l , suggesting a practical potential of related 
spaces were examined for divergent facial 
; research. 
features between European and Han Chinese, 
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Figure 3. Above: 153920540 associated with nose in females. The degree of partial facial displacement that different 
genotypes of the SNP affect on is depicted first. Then we showed the extreme faces based on residual faces with their 
allele labeled in front. The top extrapolated faces and alleles are corresponding to Han Chinese liked faces. The bottom 
extrapolations are European liked faces. Each transformation shows their front and lateral view. The last three heat plots 
are comparisons that top allele average face maps to the bottom average face along X, Y, Z axis. More blue color notes 
smaller or sunk shape, and the more red suggests larger or protruding shape. 

Below: Visualization of actual and predicted face. The first face is actual face, followed by prediction face and heat plot 
when actual face maps to predicted face. 
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Extravertsion 
r=0.209 


Figure 4. Features selected by CPLSC model from faces significantly associated with BF factors in two gender. Three 
panels are sequentially ordered from top to bottom: agreeableness-related CPLSC, which consists of the first three 
PLSCs in males; conscientiousness-related CPLSCs, which consists of the first three PLSCs in male; extraversion- 
related CPLSC, which consists of the first two PLSCs in female. In each panel, the upper faces are simulated by adding 
five standard deviations of the projected samples to the mean face; the lower faces are created by subtracting the 
standard deviation of the projected samples to the mean face. From left to right are faces rotated by 90°, 45°, and 0°. 
The bigger face following next is the mean face, on which the heat colors represent the norm value of CPLSC at each 
vertex. At the right side of the mean face are two faces the same as the faces at the left side of the mean face, but with 


a texture generated from a sample mean face. 


Ill. Signatures of personality on 
dense 3D facial images 


It has long been speculated that cues on 

the human face exist that allow observers to 
make reliable judgments of others’ personality 
traits. However, direct evidence of association 
between facial shapes and personality is 
missing from the current literature. This study 
assessed the personality attributes of 834 
Han Chinese volunteers (405 males and 429 
females), utilising the five-factor personality 
model (‘Big Five’), and collected their 

neutral 3D facial images. Dense anatomical 
correspondence was established across 

the 3D facial images in order to allow high- 


dimensional quantitative analyses of the facial 
phenotypes. In this paper, we developed a 
Partial Least Squares (PLS) -based method. 

We used composite partial least squares 
component (CPSLC) to test association 
between the self-tested personality scores 
and the dense 3D facial image data, then used 
principal component analysis (PCA) for further 
validation. Among the five personality factors, 
agreeableness and conscientiousness in males 
and extraversion in females were significantly 
associated with specific facial patterns. The 
personality-related facial patterns were 
extracted and their effects were extrapolated 


on simulated 3D facial models. 
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Future Perspective 


Direction 1: Fine-scale 
demographic inference of human 
history 


The rapidly growing sequence data of human 
genome may provide unprecedented details 
about human recent demographic history and 
selection events. We have demonstrated in the 
Section | that coalescent trees can be reliably 
re-constructed based on sequencing data. The 
coalescent trees essentially hold the maximum 
information one can achieve from the 

genetic data about the past. The well-defined 
statistical models of coalescent suggest that 
tests can be designed with relative ease to 
examine all kinds of hypotheses, such as 
admixture, ancient admixture, migration and 
disease evolution, etc. We will try to explore 
these questions based on our new framework 
of coalescent re-construction. 


Direction 2: The genetic model and 
BIG DATA analyses of human face 
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Our GWAS of common human facial variation 
was among the first to identify the genetic 
determinants of facial shape. Furthermore, a 
prediction model was proposed and proved 
to be able to predict one’s face to a significant 
extent based purely on DNA. This marked 

a milestone in the forensic technology. In 

the long run, we are going to collaborate 
closely with national forensic institutes, 

such as the the Center for Material Evidence 
Authentication, Ministry of Public Security, 

to expand the training samples, improve the 
prediction model and eventually to be able 
to apply the face prediction methods in real 
forensic scenes. 


Furthermore, we noticed that 3D face 
carries rich cues of health condition. In a 
preliminary research, we found that certain 
facial signatures are strongly associated with 
multiple health conditions, such as blood 
pressure, blood sugar, blood lipids and 
complex diseases. In the long run, we are 
pursuing an integrated solution to measure 
people's health status of different aspects, 
based on 3D face imaging. 


Hypertension 
Disease 


Average 
Face 


NO 
Hypertension 
Disease 


Figure 4. A signature of hypertension can be clearly viewed on the 3D face analysis 
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General Information 


Publications (*co-first authors, 
tco-corresponding authors) [ONLY 
INCLUDE PUBLICATIONS PUBLISHED 
IN 2015,2016 AND 2017, Group 
members highlighted in boldface]. 


+ HuSS, Xiong J, Fu P, Qiao L, Tan J, Jin L, Tang 
K#(# corresponding author). Signatures of 
personality on dense 3D facial images. Sci 
Rep. 2017 Dec;7(1):73. doi:10.1038/s41598- 
017-00071-5. PubMed PMID: 28250433. 


+ Zhang M, Wu S, Zhang J, Yang Y, Tan J, 
Guan H, Liu Y, Tang K, Krutmann J, Xu S, Jin 
L, Guan Y, Li H, Wang S. Large-scale ge- 
nome-wide scans do not support petaloid 
toenail as a Mendelian trait. J Genet Genom- 
ics. 2016 Dec 20;43(12):702-704. doi: 10.1016/ 
jj99.2016.10.003. PubMed PMID: 27964859. 


+ 3:WuS, Tan J, Yang Y, Peng Q, Zhang M, Li 
J, Lu D, Liu Y, Lou H, Feng Q, Lu Y, Guan Y, 
Zhang Z, Jiao Y, Sabeti P, Krutmann J, Tang 
K, Jin L, Xu S, Wang S. Genome-wide scans 
reveal variants at EDAR predominantly 
affecting hair straightness in Han Chinese 
and Uyghur populations. Hum Genet. 2016 
Nov;135(11):1279-1286. PubMed PMID: 
27487801. 


+ Peng Q, Li J, Tan J, Yang Y, Zhang M, Wu S, 
Liu Y, Zhang J, Qin P, Guan Y, Jiao Y, Zhang 
Z, Sabeti PC, Tang K, Xu S, Jin L, Wang S. 
EDARV370A associated facial characteris- 
tics in Uyghur population revealing fur- 
ther pleiotropic effects. Hum Genet. 2016 
Jan;135(1):99-108. doi: 10.1007/s00439-015- 
1618-6. 


+ Zhou, Hang, Sile Hu, Rostislav Matveev, 


Qianhui Yu, Jing Li, Philipp Khaitovich, Li Jin, 
Mark Stoneking, Qiaomei Fu, Kun Tang#(# 
corresponding author). A Chronological 
Atlas of Natural Selection in the Human 
Genome during the Past Half-million Years. 
bioRxiv, 2015 Highlighted by Nature News 


« Chen, W, Qian, W, Wu, G. Chen, W, Xian, 


B., Chen, X., Tang, K & Han, J. D. J.. Three- 
dimensional human facial morphologies 
as robust aging markers.Cell research. (2015) 
Highlighted by Science 


+ Valverde, Guido, Hang Zhou, Sebastian Lip- 


pold, Cesare de Filippo, Kun Tang, David 
López Herrdez, Jing Li, and Mark Stoneking. 
“A Novel Candidate Region for Genetic 
Adaptation to High Altitude in Andean 
Populations.” PloS one 10, no. 5 (2015). 


+ Qian, Wei, Hang Zhou, and Kun Tang# (# 


corresponding author). “Recent Coselection 
in Human Populations Revealed by Pro- 
tein—Protein Interaction Network.” Genome 
biology and evolution 7.1 (2015): 136-153. 
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4.9 Systems Neuroscience 


Researchers: Students 
Dr. Guang-Zhong Wang Yang Cheng 
(Principal Investigator) Jie Li 
Phone: +86-21-54920578 Guopeng Liu 
Email: guangzhong.wang@picb. Dan Jiang 
ac.cn 

Staff: 


Ganlu Hu (Postdoctoral Researcher) 
Wei Wang (Postdoctoral Researcher) 


Research sequencing data, neuroanatomy and brain 
imaging data. Current research direction 
Overview of the lab focuses on the preservation and 


evolution of co-expression networks and 


Big brain projects such as Obama's BRAIN the evolutionary constrains of single cell 


Initiative in US and European’s Human Brain transcriptome in the brain. 


Project promote neuroscience research 
into “big science” era. Following that 


trend, different countries have planed |. Brain co-expression network 


or announced more initiatives. Lessons 


learned from genomics tell us that high- Current State of Research 


DredShipuecata genenan pice sind Nd Brain is the most complex organ exists on 
analysis will become routine tasks at the “big earth, which contains hundreds of regions 
SEES AGF E Heh soul uprima with distinct structure and function. At the 
for neuroscience. The integrative analysis of moleciilar level mare than 20, 000'genés 
brain scale data faces a lot of challenges, from eneeding e itn aes gnare 
sequencing to imaging data. For example, ; ; ' 
q j : i 9 : dal P most of which are detected having clearly 
it is still unclear how the underling massive ; jaak PR ‘ 
o , 9 expression profiles in brain tissues. Also brain 
molecular activities can be integrated to tissues are reported having more complex 
reflect different brain functions and how sec 
alternative splicing pattern as well as RNA 
INE SE EUEN crane OVE TNE can editing sites compared with other tissues. It 
lead to the development of human brain. M . l ; 
Keb p j | y is assumed this molecular complexity leads 
research laboratory is interested in solvin i ; : 
y 9 to their functional complexity at higher level. 
Wose ue ston BycexpiOnnG sytem MoIngy However, how to fully explain the advanced 
sp Caches We aretonducting integrative functional complexity of different brain 
altel ol Tor prdne a trani pone Hala, regions from various molecular activities 
molecular evolutionary data, single cell , 
y 9 transcribed and translated remains unsolved. 
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The decoding of brain transcriptome is 

not trivial, as in each brain region distinct 
molecules expressed. And the temporal 
expression pattern of each region varies 

a lot during development. Hundreds of 
differentially expressed genes (DEG) can be 
identified temporally and spatially within 
sequencing samples, which may play 
different roles. The regional DEG implicates 


the functional differences among certain 
brain areas and the developmental DEG 
indicates how the developmental processes 


are regulated in a particular region. The overall 
distribution patterns are of interest to this field. 
By investigating transcriptional divergence 
among neocortical regions, a hourglass model 
has been proposed during development. 
Samples from infancy and childhood show 
the least number of DEG while fetal and 
postnatal brains have the largest number of 
DEG. Although the trend is not observed for 
non-neocortical regions, it indicates some 
fundamental principles for the development 
of central nervous system. 
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Figure 1. fMRI associated genes are tensely connected in the cortical co-expression network. (adapted from 


Wang, et al, Neuron, 2015) 
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Figure 2. Co-expression network of neocortical regions and the enrichment of memory encoding associated 


genes in different modules (unpublished data). 


Network approach is commonly used to 
reveal the complexity of transcriptomic data. 
Co-expression networks are reconstructed by 
linking all the genes with similar expression 
profile together, regardless of whether they 
are from different regions or different time 
series. The genes pairs that are physically 
interacted or located in the same biological 


pathway or protein complex will exhibit 
strong co-expression pattern while random 
gene pairs not. Thus genes having similar 
functions will be clustered together in the 
network. Utilizing this principle, new genes 
and pathways that are important to brain 
function or critical to the pathogen of disease 
would be discovered. By correlating brain 
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resting-state imaging data and brain wide 
transcriptome data, we have identified 38 
genes which expression level show significant 
association with brain functional activity 
(Wang, et al., Neuron, 2015). Those 38 genes 
are likely to be down-regulated in autism brain 


samples and they have higher co-expression 
levels than expected by chance (Figure 1), 
highlighting the functional similarity of those 
genes. In another study we searched for 

the correlation between molecular signals 
and oscillation changes during memory 
encoding in human brain (Berto and Wang, et 
al., Cerebral Cortex, in revision) and got more 
than 150 significant candidates. The genes 
we found prefer to be expressed in neuron 
and are enriched in autism genes. By utilizing 
co-expression network analysis we identified 
additional genes that might connect to 
memory encoding oscillations (Figure 2). 


The co-expression relationship may be 
conserved between networks, which imply 
different things depending on the design of 
the experiment. If samples are from replicated 
studies or studies with similar experimental 
design, the conservation of co-expression 
measures the robustness of the computational 
analysis. If samples are from different species, 
preserved modules may indicate some 
evolutionary conserved feathers while specie 
specific modules might be linked to special 
adaptation, both of which are of interest. For 
instance, by comparing sequencing data 
from brain regions in Human, Chimpanzee 
and Macaque, a human specific module with 
the famous circadian regulator CLOCK gene 

as a hub gene was identified, together with 

a module contains the language important 
gene FOXP2. 


Future Perspective 


Multiple large scale co-expression networks 
have been constructed during the last 
several years, from mouse to human, and 
from prenatal to aging brain. Many novel 
genes that are either important to a specific 
developmental stage or to some disease 
status have been characterized. However, 
the conservation property of those modules 
among different networks is not fully 


explored, which provide us an opportunity 
of investigating the dynamic changes of 
them during development and through 
evolution. Additionally, although it is known 
some modules are well preserved in different 
networks, the mechanism of this preservation 
is unknown. As a function unit of biological 
networks, members in protein complex are 
highly co-expressed and co-regulated. The 
conservation of co-expression modules might 
partly due to the conservation of protein 
complex. To test this hypothesis, we collected 
the major co-expression networks in different 
brain developmental stages, both for normal 
brain and brain with Autism diseases. We 
have calculated the conservation of protein 
complex in co-expression modules. Indeed, 
we found that 90% of protein complexes 

are enriched in just one co-expression 
module (Figure 3). We are planning to 
examine whether this preservation leads to 
the conservation of co-expression modules 
and whether disease gene related modules 
are clustered. This project will provide a 
mechanism of how brain related co-expression 
networks change during development and 
lead to diseases. 
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Figure 3. Conservation of protein complex in brain co- 
expression modules. 90 percent (552) of protein complex are 
enrichment in just one co-expression module (unpublished 
data). 


Il. Evolutionary constrain of single 
cell transcriptome in the brain 


Current State of Research 


Unlike liver, kidney or heart, which functions 
are quite homogeneous, different brain 
regions have different advanced functions. 
The great heterogeneity of the brain mainly 
comes from their complex cell types. Up to 
date, there is still no clue of how many cell 
types exist in a certain brain region, which 

is a limitation for the application of high 
throughput sequencing and analysis of 
brain tissues. As the emergence of single cell 
sequencing technology has the potential 

of classify all the cells in a given volume of 
tissue, more and more neuroscientists explore 
single cell transcriptome to characterize 

the expression profile of their interested 
samples. As a result, tens of new cell types 
have been identified so far, many of which 


may have important functional roles to the 
nervous system. Additionally, with the activity 
of thousands of genes being profiled in 
individual cells, novel cell type biomarkers are 
identified as well. Many of these new markers 
are more robust than the well-known markers 
reported previously. 


Many challenges exist for analyzing single cell 
transcriptome data. On one hand, the current 
sequencing pipelines can only capture a few 
thousands of genes for each cell, leaving 
many lowly expressed transcripts undetected. 
Even for the transcripts that are captured, 
many of them have very low expression 
values. Those factors restrict the application of 
sophisticated methods in the comprehensive 
analysis of the data, like what was performed 
in bulk cell sequenced samples. On the other 
hand, computational methods or pipelines 
need to be developed to understand the 
heterogeneity and function of those cells. 

The first challenge might be solved soon 

as cheaper and more powerful sequencing 
technology is being developed in the field. 
The second concern will become more urgent 
as it binds to specific scientific questions. 


Neuron cells are connected to form neural 
networks. The functional unit of neural 
network is neuronal circuit. Neurons have 
to be well organized in the circuits in order 
to perform certain functions. In order to 
process information effectively, the design 
of neural network follows various principles, 
leading to different constrains act on the 
neuronal circuits. For instance, neuronal 
networks are designed by following economy 
trade-offs rules between reducing the cost 
and increasing adaptive topological values. 
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And many aspects of the network work to 
minimize its free energy. Those constrains 
are not reported for non-neuron cells. 

One interesting question is whether these 
functional constrains at network level would 
be reflected at molecule level? It is likely to 
be so as recently it is widely reported the 
advanced brain activity are associated with 
the expression levels of hundreds of genes 
Here we tested whether neuronal cells have 
higher evolutionary constrains than non- 
neuron cells in the brain, by exploring single 
cell sequencing data. It is reported previously 
that there are more evolutionary constrains in 
the transcriptome of cortical region compared 
with subcortical region. Evolutionary 
constrains were measured by the correlation 
between expression level and evolutionary 
rate of coding sequences (dN/dS), which is 
usually negative and thus called the E-R anti- 
correlation. In regions with similar functions, 
we did not find any significant differences 


of E-R anti-correlations among them (Figure 
4), indicating overall they have similar 
evolutionary constrains. We then calculated 
this measurement at single cell level in more 
than 3,000 cells. Significant inverse correlations 
between expression and dN/dS ratio are 
widely observed for individual cells (Figure 5). 
Interestingly, excitatory and inhibitory neurons 


show stronger correlation than non-neuronal 


cells, implicating stronger evolutionary 
constrains on the transcriptome of those cells. 
encoded in the genome. 


correlation between expression 
level and ER 


a 
T f f t 
J 
om 
We 822 bah Aa? A be t a Bat aati 


Figure 4. No significant differences of E-R anti-correlation 
levels were detected in ten neocortical brain regions 
(unpublished data). 


Here we tested whether neuronal cells have 
higher evolutionary constrains than non- 
neuron cells in the brain, by exploring single 
cell sequencing data. It is reported previously 
that there are more evolutionary constrains in 
the transcriptome of cortical region compared 
with subcortical region. Evolutionary 
constrains were measured by the correlation 
between expression level and evolutionary 
rate of coding sequences (dN/dS), which is 
usually negative and thus called the E-R anti- 
correlation. In regions with similar functions, 
we did not find any significant differences 


of E-R anti-correlations among them (Figure 
4), indicating overall they have similar 
evolutionary constrains. We then calculated 
this measurement at single cell level in more 
than 3,000 cells. Significant inverse correlations 
between expression and dN/dS ratio are 
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widely observed for individual cells (Figure 5). cells, implicating stronger evolutionary 
Interestingly, excitatory and inhibitory neurons constrains on the transcriptome of those cells. 
show stronger correlation than non-neuronal 
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Figure 5. Significant E-R anti-correlations are detected at single cell level in brain tissue, with excitatory and 
inhibitory neuron show stronger correlations than non-neuronal cells (unpublished data). 
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Figure 6, A unified model of evolutionary constraint for different cells in brain tissues (unpublished data). 
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Future Perspective 


By profiling large number of molecular 
activities in one cell, single cell technology 
provides an unprecedented opportunity to 
survey all of the cell types utilized in the brain. 
Many important cell types are to be discovered 
in the future and novel methodologies and 
concepts need to be developed for the 
analysis of those data. By investigating the 
evolutionary constrains of each cell, our group 
is testing the hypothesis that cells that are 
involved in neuron circuits are under higher 
evolutionary constrains than other cells. We 
will then compare evolutionary constrains 

in disease related cells with that in normal 
cells. Disease related cells might be under 
different constrains as their transcriptome 

has been mis-regulated. Finally, the average 
E-R anti-correlation of cells with randomized 
gene expression will be calculated in order 

to compare with that of normal cells, and a 
unified model of evolutionary constrains at 
single cell level will be explored in the brain 


(Figure 6). 
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4.10 Epigenome Biology 


Researchers: Students 

Dr. Gang Wei Zhijun Han 

(Principal Investigator) LiMa 
Phone: +86-21-54920557 Xuan Cao 


Email: weigang@picb.ac.cn Xiaoran Zhang 


Staff: 


Ying Li (Research Associate) 
Xuelong Wang (Research Associate) 


Research 
Overview 


The genetic information in eukaryotic cells 

is present in the context of chromatin. The 
epigenome, which mainly consists of DNA 
methylation, histone modification and higher- 
order chromatin structure, can differ from cell 
type to cell type and highly contribute to the 
distinct patterns of gene expression during 
cellular differentiation. Our research interests 
focus on how chromatin structure, especially 
epigenetic modification patterns and higher- 
order chromatin structure are established 
during cellular differentiation process and how 


they contribute to normal development and 
disease states. By combining genome-wide 
experimental approaches with computational 
data analysis, we will be able to address these 
questions systematically and gain interesting 
insights into the relationship between 
chromatin structure and gene transcription. 


Epigenetic modifications 


Current State of Research 


Epigenetic mechanisms are implicated in 
regulating gene transcription by controlling 
chromatin structure and DNA accessibility. 
Recent advance of next-generation 
sequencing technology allows us to monitor 
the chromatin state of specific cell type in 
genome wide and greatly enhances our 
understanding of transcriptional regulation by 
epigenetic mechanisms. 


1) Epigenetic modifications in cancer 


Epigenetic regulation has emerged to be 

the critical steps for tumorigenesis and 
metastasis. Many histone methyltransferases 
and demethylases have been implicated as 
tumor suppressors or oncogenes. However, 
the key epigenomic events in cancer cell 
transformation still remain poorly understood. 
To study the underlying mechanisms of 
breast cancer transformation, we utilized an 
established cell-based transformation model. 
By profiling several histone modifications 
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Fig.1 Epigenomic dynamics during malignant transformation in breast cancer model. 


in the four cell lines that represent different 
stages of tumor cell transformation, we 
observed a gradual reduction of H3K9me2 
and me3 along with transformation 

(Zhao, et al, 2016, Clinical Epigenetics). The 
immunostaining assay in cancer tissues also 
showed strong reduction for these two 
histone modifications. H3K9mez2 is generally 
believed toe me a transcriptional repression 
mark and often modifies broad regions on 
chromatin, which have been previously 
named as large organized chromatin K9 
modifications (LOCKs). Consistent with the 
change of H3K9mez2 at global level, the total 
number and genome coverage of H3K9me2 


LOCKs also decreased during transformation. 
Surprisingly, we found that the genes located 
in the boundaries of the decreased LOCKs 
are enriched with cancer-related pathways 
(Fig.1). This suggested that the localization 
of genes in LOCKs is related with cellular 
functional changes and unlocking the 
oncogenes at the boundaries facilitates 

the transformation process. To explore the 
underlying molecular mechanisms, we 
examined the gene expression levels of all 
H3K9 methyltransferases and demethylases 
by RNA-seq and found that KDM3A, a 


demethylase for H3K9mez2, gradually increased 


gene expression during transformation. To 
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Fig.2 Positive role for polycomb protein in transcriptional regulation. 


further investigate the role of KDM3A in PR-set7 to regulate Sens transcription by 


regulating transformation, we found KDM3A 
directly binds to the oncogenes with LOCKs 
and regulates their transcription by removing 
H3K9me2 mark. 


2) Epigenetic modifications in development 


PcG proteins, initially shown to maintain the 
heritable repression of transcription, play an 
important role in controlling the expression 
of genes essential for development, 
differentiation and maintenance of cell 

fate. Although several isolated studies 
suggest a positive regulatory role of PcGs 

in transcription, the underlying mechanism 

is largely unknown. To explore the positive 
regulatory role of PcG in transcription, we 

first searched for highly conserved genes 
involved in the developmental control that are 
positively regulated by Pc and found several 
genes including Sens, Rosy and Spn100A (Ly, 
et al, 2016, Cell Research). Further investigation 
showed that Pc functions together with 


modulating H4K20me!. To further explore if 
the positive regulation of Sens transcription 
by Pc represents a general mechanism, we 
searched for the co-target genes that have 
binding of Pc, H3K27me3 and H4K20mel 

in genome-wide. In total, 613 genes were 
identified as Pc*H3K27me3*H4K20mel* while 
438 genes were Pc'H3K27me3*H4K20me! . 
Generally, the average expression levels 

of Pc*H3K27me3*H4K20mel* genes 

were significantly higher than that of 
PctH3K27me3*H4K20me} . It thus appears 
that H4K20me! acts as a selective mark 

to determine the transcription state 

of Pc*H3K27me3* genes. Further, we 
generated binding heat maps of Pc, 
H3K27me3, H4K20mel, and RNA polymerase 
2 at PctH3K27me3*H4K20mel* and 
Pc*H3K27me3*H4K20mel’ genes. The result 
showed that Pc signals tend to peak around 
TSSs in the Pc‘H3K27me3*H4K20mel* 
genes, whereas at the gene bodies the 


Pc signals were significantly reduced. In 
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contrast, this reduction was not significant in 
Pc*H3K27me3*H4K20mel genes. In addition, 
the ratio of the Pc signal around TSS to that 

at the gene body (PCpromoter/ PCgene body) 
positively correlated with the gene expression 
level. To further explore the mechanism of 
Pc-mediated positive regulation of gene 
transcription, we performed motif analysis. 

We found a significant enrichment of the 
binding motif for transcription factor Br in 

the Pc'H3K27me3*H4K20mel* genes but not 
in PctH3K27me3*H4K20mel genes (Fig.2). 
Thus Pc functions together with specific 
transcription factors such as Br, and epigenetic 
enzymes such as PR-Set7 to positively regulate 
gene transcription. 


3) Epigenetic modifications in cell fate 
conversion 


Somatic cells can be reprogrammed to 
pluripotent stem cells or transdifferentiate to 
another cell type. For example, transduction 
of Gate4, Hnfla and Foxa3 can direct induce 
terminally differentiated fibroblast cells 

into functional hepatocyte-like (iHep) cells. 
Much effort has been made to unravel the 
epigenetic mechanisms underlying the cell 
fate conversion. However, limited studies have 
been made for understanding the dynamic 
changes during the early lineage conversion 
process. Using iHep system, we dissected 
the dynamics of transcriptome, histone 
modifications, chromatin accessibility and 
binding of key transcription factors during 
the direct hepatic lineage conversion (Fig.3) 
(Zhu, et al, 2017, Cell Research). By comparing 
the transcriptome data during this process, 
we found that gene expression starts to 
change early from the intermediate stages 
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Fig.3 Dissecting the chromatin regulation during 
induced hepatic lineage conversion. 


and has another dynamic wave of change 

at late stage. Epigenomic profiling showed 
that active histone markers such as H3K27ac 
and H3K4me1 switch from early times which 
are positively correlated to early-stage gene 
expression changes; while repressive histone 
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marker H3K27me3 exhibits dramatic change 
only at late stage, which may contribute to 
up-regulation of the late-induced hepatic 
genes. The early-induced genes are accessible 
for the key transcription factor Foxa3 binding 
at early time; while the late-induced hepatic 
genes are located at repressive chromatin 
regions enriched by H3K27me3 or H3K9me3 
modifications, which indicates that H3K27me3 
may act as the epigenetic barrier for hepatic 
lineage conversion. In addition, we found that 
Foxa3 prefers to bind to enhancer regions to 
regulate late-induce genes whereas it is more 
enriched at promoter regions of early-induced 
genes, which may lead to the different 
activation patterns of these two groups of 
genes. Taken together, these results suggested 
that cell type specific transcription factors 
need to cooperate with epigenetic regulators 
to overcome the epigenetic barriers, thus 
activating cell type specific enhancers and 
inducing transcription of target genes. 


Future Perspective 


Epigenomic profiling has aided the discovery 
of genome-wide coordinated chromatin 
changes that occur during development and 
disease. By bioinformatics analysis, numerous 
candidate sites have been predicted to be 
putative cis-regulatory elements such as 
enhancers. Dynamic epigenetic changes 

on these sites are the driving force that 
regulates the turn-on or —off of the genetic 
program in many biological processes. 

In the process of trans-differentiation, we 
have found that H3K27me3 and H3K9me3 
modifications are two major epigenetic 
barriers for hepatic lineage conversion. To 
further explore the underlying mechanism, 


we will try to investigate how specific 
transcription facts interact with epigenetic 
regulators. By comparing their genome- 
wide binding patterns and dynamic changes 
of corresponding histone modifications, 

we will find some candidate epigenetic 
regulators that could play key roles in cell fate 
conversion. Then we will investigate whether 
genetically modulating these epigenetic 
regulators can promote trans-differentiation. 
We will modulate these epigenetic regulators 
by knocking down assay or using smal 


molecule inhibitors and examine the 


genome-wide change of corresponding 
histone modifications. If the dynamic change 
of certain histone mark, its target genes, and 
its specific epigenetic regulator are made 
clear, we can definitely shift the histone 
modification patterns towards that facilitating 
target cell specific gene program at the right 
time by modulating the epigenetic regulator 
expression level. 


Emerging evidence suggests that non- 
coding RNAs also play an important role 

in regulating epigenetic modifications by 
recruiting epigenetic regulators such as 
histone methyltransferase or demethylase to 
their binding sites. In future, we will investigate 
the functional roles of non-coding RNAs in 
epigenetic regulation in cell fate conversion 
process. We have selected a number of 
candidate RNAs based on their nuclear 
location and cell-type specific expression 
pattern. We will further examine the genome- 
wide binding sites of these RNAs by ChIRP 
(Chromatin Isolation by RNA Purification) 
assay and compare their binding patterns 
with the binding patterns of different histone 
modifications, transcription factors and 
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epigenetic regulators. By integrating these 
epigenomics data, we will be able to predict 
the function of candidates non-coding RNA. 
Further biochemistry and knocking down 
assays will be performed to validate the 
function. All these genome-wide analysis will 
help us to deeper understand the underlying 
molecular mechanisms of dynamic changes of 


histone modifications 


Higher-order chromatin structure 


Current State of Research 


Emerging evidence have been suggesting 
that higher-order chromatin structure plays an 
important part in regulating genome functions 
such as transcription, DNA replication and 
DNA repair. Regulatory elements can act over 
large genomic distances to modulate genes 
expression by formation of chromatin loops 
that physically link the elements with its 
target genes. However, the mechanisms by 
which chromatin interactions are formed and 
maintained during development remain to be 
elucidated. The Chromosome Conformation 
Capture (3C) technology and its derivatives 
have been widely used to detect chromatin 
interactions and greatly contributed to 
understanding of the relationship between 
genome organization and genome 

function. These techniques have uncovered 
general features of genome organization 

from chromosome territories, chromatin 
compartments, topologically associating 
domains (TADs), insulated domains and 
chromatin loops. 


Using high-resolution Hi-C assay, we 


investigated the higher-order chromatin 
structure in Plasmodium falciparum, the 
parasite of malaria. At global level, the 
whole genome displays strong centromere- 
to-centromere and telomere-to-telomere 
interactions, indicating a very specialize 
genome organization in this organism. More 
interestingly, detailed chromatin interaction 


map reveals intensive chromatin interactions 
between genes from the same gene family, 
the var gene family. Var gene family consists 
of 60 var genes, each encoding a variant 
antigen that is expressed on the surface of 
the parasite-infected red blood cells and is 

a critical virulence factor for malaria. During 
the infection, only one var gene is activated 
while other var genes are silenced, which is 
the immune evasion mechanism to avoid the 
host’s antibody responses. By further analysis, 


we identified a novel type of interaction 
domain that includes most of var genes and 
telomere regions. This specific domain is 
highly enriched with H3K36me3, H3K20me3 
and H3K9me3 modifications and genes within 
this domain are in transcriptional repressive 
state. Moreover, deletion of PfSETvs, the 
methyltransferase of H3K36me3, leads to 
significant decrease of chromatin interaction 
frequency within this interaction domain, 
coupled with the reduction of H3K36me3 
level and activation of most var genes in this 
domain. This result suggests that specific 
epigenetic regulator could regulate gene 
transcription through modulating chromatin 
interactions in the interaction domain. To 
further explore the underlying mechanisms, 
we use ChIRP assay to map the binding sites 
of a specific non-coding RNA, which is the 
antisense RNA of the uniquely expressed var 
gene. The result shows that this antisense RNA 


PI INDEPENDENT RESEARCH GROUPS 


binds to almost every var gene and telomere 
region in the interaction domain, suggesting 
the high-frequency chromatin interactions 
could be mediated by both the non-coding 
RNA and histone methyltransferase. 


The rapid development of high-throughput 
chromatin interaction assays such as Hi-C and 
fast accumulation of huge amount of data 
raise great challenges for data analysis and 
interpretation. So far, many bioinformatics 
tools have been developed for analyzing Hi-C 
data, such as HOMER, HiGinspector, HICUP, 
HiCdat, HiCbox, hiclib package HiGPro and 
so on. However, majority of these tools can 
only cover a particular step of the whole 

Hi-C workflow. Thus we develop a new Hi-C 
data processing tool named hicMap, which 


is fully command-line based, install free and 
highly paralleled, making biologists finish 
Hi-C data processing in a single command in 
a few hours. Besides, we also provide a utility 
kit named hicMapTools, which contains all 
functions in hicMap and many other popular 
Hi-C tools, to help users custom their analysis 
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tep by step and perform post-processings. 


At the step of mapping, we use an iterative 
shifting mapping strategy based on fixed 
short length instead of variable reads length, 


because Bowtie performs much better on 
short reads and for most cases we do not 
need the full length to get the reads uniquely 
mapped. At the step of pairing, we directly 
replace all read names from string to numeric 
IDs during mapping stage, and then mapped 
reads are binned into small pieces according 
to numeric ranges; hence pairing can be done 
without sorting the mapping results and can 
be parallelized. 


Future Perspective 


The chromatin compartments, TADs and 
chromatin loops are different layers of higher- 
order chromatin structure. How do these 
three-dimensional structures form, maintain 
or even dynamic change in development 

or disease still remains elusive. To address 

this question, we plan to study chromatin 
interactions during cell trans-differentiation 
process and diseases. 


In trans-differentiation process, it will 

be interesting to know how the force- 
expressed transcription factors activate cell 
specific enhancers, how they are involved in 
establishing long-range enhancer-promoter 
interactions and therefore regulating 
transcription of target genes. We are also 


interested in investigating how architecture 
proteins such as CTCF, cohesion complex, 
nuclear lamins and mediators, interact with 
the transcription factors to regulate chromatin 
interactions. We have generated several 

cell lines that either knock down expression 
level of these proteins or have specific point 
mutations on these proteins. In future, we 

will investigate how large domains such as 
chromatin compartments and TADs change 


in the knockdown/mutation cells using Hi-C 
assay. In order to profile the detailed structure 
of chromatin loops, especially the interactions 


between enhancers and promoters, we need 
to get higher-resolution maps of chromati 
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interactions. There are two mapping strategies 
that we plan to use in future: capture Hi-C 
which uses pre-designed oligo nucleotides 

to enrich chromatin interactions associated 
with specific regions and Tnp-HiC which 

only detects chromatin interactions between 
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DNase hypersensitive sites such as active 
enhancers and promoters by partially 
digesting chromatin DNA. By these means, 
we will be able to get detailed chromatin 
interaction maps and further investigate how 
cell type specific enhancers are activated, 
connected with the promoters of their 
target genes and finally activate target genes 
transcription. In addition, by integrating Hi-C 
data with ChIP-seq data of transcription 


factors binding and histone modifications, 


we will examine which transcription factors 
or epigenetic regulators are highly involved 
or even mediate long-range chromatin 

interactions between regulatory elements. 


In the process of disease such as cancer, 

we will study how higher-order chromatin 
structure changes in the state of disease and 
how these structural changes affect normal 
transcriptional program. High frequency of 
mutation rate on architecture protein genes 
such as CTCF and cohesion has been found in 
multiple types of cancer, suggesting higher- 
order chromatin structure plays an important 
role in maintaining normal transcription 
circuit. Using Hi-C assay, we will identify the 
regions that have significant structural change 
at TAD level and investigate the underlying 
mechanisms. In addition, the expanding GWAS 


data have led to the identification of huge 
number of human disease-associated variants. 
As lots of risk-associated variants are located at 
non-coding regions, it is challenging to assign 
function to these disease variants. To address 
this question, we will use high-resolution 

Hi-C assay to profile detailed structure of 
chromatin interactions in normal tissue and 
disease samples. By these means, we will 

be able to link the disease associated non- 


coding variants to key target genes and thus 
functional annotate the genetic variations. 


Another important factor that can affect 

chromatin organization is non-coding RNAs. 
Some non-coding RNAs have been reported 
to regulate transcription by mediating 
chromatin loops between regulatory 


elements. Therefore it will be interesting 


to identify non-coding RNAs that could 


regulate chromatin structure. In particular, 

we will examine a group of candidate RNAs 
that are located in cell nucleus, have tissue- 
specific gene expression pattern, interact with 
architecture proteins or some transcription 
factors. We will need to further investigate the 
genome-wide binding patterns of specific 
candidate RNAs by ChIRP-seq assay and check 
whether these RNA binding sites are enriched 
at chromatin interacting regions. Finally, we 
will knock down specific candidate RNA and 
examine how chromatin interactions change 
at its binding sites and how these changes 
affect transcription of related genes. 
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Research 
Overview 


With the advent of high-throughput deep- 
sequencing technologies and computational 
approaches, whole transcriptome analyses 
reveal that the vast majority of genomic 
sequence is transcribed into a diverse range 
of protein-coding RNAs and non-coding RNAs 
(ncRNAs) and the landscape of eukaryotic 
transcriptome is far more complex than 
originally imagined. In addition, accumulated 
lines of evidence suggest that RNA-based 
regulatory mechanisms play significant 

roles in numerous biological phenomena 

— from controlling the architecture of 

whole chromosomes and regulating gene 
expression to encoding the translation of 


genetic information into protein sequences. 
In the last several years, our group has applied 
the-state-of-the-art next-generation deep- 
sequencing technologies together with, 
diverse biochemical methods and specific 
computational strategies to identify, profile 
and most importantly characterize new types 
of long noncoding RNAs (IncRNAs), especially 


covalently-closed circular RNAs (circRNAs) 
from back-splicing. Another focus in the lab 
is to depict new events of RNA editing by 
developing novel computational pipelines/ 
algorithms and to decipher novel functions 
of RNA editing enzymes, the ADARs. These 
findings shed new light on the diversity and 
complexity of transcriptome, and suggest 
previously under-appreciated regulatory roles 
of RNAs. 


|. Identification and characterization of 
new types of IncRNAs 


Although only about 2% of the human 
genome encodes protein sequences, recent 
advances in genome-wide analyses have 
revealed that the majority of the human 
genome is transcribed, largely from noncoding 
segments that used to be considered as “junk 
sequences” or “dark matter”. Besides well- 
characterized housekeeping noncoding RNAs 
(such as tRNA, rRNA, snRNA and snoRNA) and 
small regulatory ncRNAs, the transcriptome 
has become even more complex with 
pervasively transcribed long noncoding RNAs 
(IncRNAs, at least 200 nt long). 
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1. Biogenesis and function of intron- 
derived sno-IncRNAs 


We have pioneered in developing poly(A)— 
RNA-seq and revealed novel types of IncRNAs. 
One type of IncRNAs we discovered is 
snoRNA-related IncRNAs (sno-IncRNAs) whose 
ends correspond to positions of intronic 
snoRNas (Yin, et al., 2012, Molecular Cell). 

We found that a group of five sno-lncRNAs 

is highly expressed from the imprinted 
Prader-Willi syndrome (PWS) region on 
human chromosome 15. However, sno- 
IncRNAs from other regions of the human 
genome or from other genomes have not yet 
been documented. To further explore sno- 
IncRNA candidates, we developed a custom 
computational pipeline to predict sno- 
IncRNAs by integrating snoRNA annotations 
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with poly(A)-/ribo- RNA-seq datasets from 
human, rhesus and mouse. By applying this 
pipeline, we have systematically annotated 
sno-IncRNAs expressed in all three species 
(Zhang, et al., 2014, BMC Genomics). We 
revealed that although snoRNA ends of such 
molecules are highly conserved, PWS region 
sno-IncRNAs are highly expressed in human 
and rhesus, but absent in mouse (Figure’). The 
absence of PWS region sno-IncRNAs in mouse 
suggested a possible reason for the failure of 
the current mouse model to fully recapitulate 
pathological features of human PWS. Only 
one mouse sno-IncRNA was identified from 
the limited available mouse datasets in RPLI3A 
region, and snoRNAs themselves in this 
region have been suggested to be involved in 
lipotoxicity in mouse. Interestingly, the RPLI3A 
region sno-IncRNA is barely detectable in 
human. Our results also demonstrated that 
the formation of sno-IncRNAs often requires 
alternative splicing within their parent genes 
(Figure 2). This study thus further indicated a 
complex regulatory network of coding and 


noncoding parts of the mammalian genome. 


Figure 1. Unique expression of PWS region sno- 
IncRNAs across species. Normalized read densities of 
poly(A)-/ribo—RNA-seq (red) and poly(A)+ RNA-seq (black) 
from human (A), rhesus (B) and mouse (C) (Adapted from 
Zhang, et al, BMC Genomics, 2014). 
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Mouse RPL13A (uc009gtk.1) 
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Previously unannotated Mouse RPL13A 
alternative transcript 
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Figure 2. Species-specific RPL13A region sno-IncRNA is derived from species-specific alternatively spliced 
rp!13a transcripts. (A) De novo transcript assembly revealed previously uncharacterized alternative spliced rpl13a 
transcripts. (B) No such alternative spliced rpl13a transcripts in human ESC H9 and HeLa cells from both de novo assembly 
(h9_pAplus_cufflinks) and known annotation (blue lines). (C) Validation of alternative spliced rp!13a transcripts by RT- 
PCR. (D) The alternative splicing of RPLI3A causes amino acid changes at the Cterminal of RPL13A protein (Adapted from 
Zhang, et al, BMC Genomics, 2014). 


2. Biogenesis of exon-derived circRNAs affect human atherosclerosis risk or regulate 

mRNA expression, thus shedding new light 
Another long non-coding RNA we have on physiological roles of circular RNAs. With 
been focused on is circular RNA. Circular the advent of high-throughput sequencing 
RNA molecules were first considered as from nonpolyadenylated RNA transcripts, 
byproducts of splicing errors. Recently, thousands of circular RNAs from back-spliced 
circular RNAs from human INK4a/ARF and exons were successfully identified in multiple 
CDRI loci were identified and suggested to human cell lines. However, the detailed 
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mechanism of circular RNAs’ biogenesis 
remained elusive. 


Here, we developed a combined strategy 
(Figure 3, CIRCexplorer) in consideration of the 
rearranged exon ordering in back-spliced exon 
events to identify junction reads from back- 
spliced exons and systematically characterize 
circular RNAs from H9 human embryonic 
stem cells (hESCs) (Zhang, et al., 2014, Cell). 
This computational pipeline is highly efficient, 
memory economy, easily accessed and user 
friendly. It allowed us to detect 2,119 and 9,639 
exonic circular RNA candidates with at least 
one back-spliced junction read in poly(A)- 

or poly(A)-/RNase R RNA-seq, respectively. 


Importantly, many of these circular RNAs 
were confirmed to be processed from back- 
spliced exons by divergent PCR. In addition, 
we identified a number of multiple exon 
circularization events produced from single 


gene loci and named this phenomenon 
alternative circularization (AC). The existence of 


fev) 


ternative circularization suggests yet another 
layer of gene expression regulation of circular 


RNA formation. 


Genomic features investigations revealed that 
most exonic circular RNAs contain multiple 
exons, most commonly two to three exons. 
Interestingly, exons from circular RNAs with 
only one circularized exon were much longer 
than those from circular RNAs with multiple 


circularized exons, indicating that processing 
may prefer a certain length to maximize 
exon(s) circularization. Furthermore, flanking 
introns of circularized exons were much longer 
(~5-fold) than randomly selected introns, 
which could introduce more Alu elements to 
promote exon circularization. Together, these 
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Figure 3. A computational pipeline for back-spliced 
junction read calling to accurately annotate circular 
RNAs (Adapted from https.//github.com/YangLab/ 
CIRCexplorer). 


investigations provided valuable clues for the 
mechanisms underlying exon circularization. 


Circular RNAs are derived from back-spliced Pol 
ll transcripts with their canonical spliced linear 
mRNA counterparts. The processing of back- 
splicing requires the canonical spliceosomal 
machinery and is modulated by both cis- and 
trans-regulators. Different from the canonical 
splicing that joins an upstream 5’ splice (donor) 
site with a downstream 3’splice (acceptor) site 
in a sequential order to produce a linear RNA, 
back-splicing occurs in a reversed orientation 
that links a downstream 5’ splice (donor) site 
to an upstream 3’ splice (acceptor) site to yield 
a circRNA. These known features of circRNA 
biogenesis suggest that circRNA processing 

is, in principle, linked to transcription and 
pre-mRNA splicing. Previous study revealed 
that final mRNA levels are balanced between 
their production and degradation. This may 
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be the case for circRNA as well. We applied 
metabolic tagging of newly transcribed RNAs 
by 4-thiourdine (4sU) to each nascent circRNA 
and developed computational algorithms 

to calculate circRNA-processing kinetics 
globally in human embryonic carcinoma PAI 
cells. This comprehensive dataset allowed 

us to quantitatively measure and compare 
parental gene transcription elongation, pre- 
mRNA splicing, and circRNA back-splicing 

at individual gene loci across a time course 


lasting 16hr. We gained several previously 
unknown insights into circRNA biogenesis 

by analyzing the kinetics of nascent circRNA 
processing. Moreover, investigation of nascent 
circRNA processing in undifferentiated human 
embryonic stem cells (hESCs) and their 
differentiated forebrain (FB) neuron progenitor 
cells further revealed how abundant and 
dynamic expression of circRNA is achieved 
upon neuronal differentiation. 


Using 4SUDRB-seq we collected datasets from 
AsU labeling for 10 and 15 min to measure 
transcription elongation rates (TERs) with a 
newly developed computational pipeline. We 
have found that the average length of nascent- 
circRNA-producing genes was significantly 
greater than that of non-circRNA-producing 


genes. Thus, it required a longer time for 
such genes to complete their transcription. 
We therefore prolonged the 4sU incubation 
of PAI cells to 30,60 and 120min and even 

to 4 and 16hr to identify as many circRNAs as 
possible to characterize the kinetics of circRNA 
processing and decay during transcription. 
Together, we have generated rRNA-depleted 
4sUDRB-seq datasets from PA cells, RESC H9 
cells, and H9 differetiated FB cells over a wide 
time course that allow us to capture newly 


transcribed circRNAs from long genes and 

to study the coupling of circRNA processing 
with transcription and splicing. We carried 
out detailed analyses on the 4sUDRB-seq 
dataset in PA1 cells and only detected a 
dozen nascent circRNAs at 10 and 15 min. 
With increased time points of 4sU labeling, 
hundreds of nascent circRNAs were identified 
within 120 min after transcription initiation. 
Interestingly, the efficiency of back-splicing 
was still generally low compared with the 
adjacent canonical splicing events at these 
time points. As circRNA biogenesis depends 
on the splicesomal machinery, our newly 
developed TERate pipeline and a published 
method revealed that the average TER of 
nascent circRNA-producing genes was higher 
than that of non-circRNA genes, for example, 
2.9 kb/min versus 2.29 kb/min calculated by 
the TERate pipeline. This analysis suggested 
that circRNA formation correlates with fast Pol 
ll elongation. We next constructed three cell 
ines that transfected with either a WT Pol Il or 
one of two mutant versions to further confirm 


this observation. The Pol Il mutants carried 
either R749H or E1126G, which individually 
decelerates or accelerates transcription. We 
detected lower levels of nascent circRNAs with 


reduced TER, higher levels of nascent circRNAs 
with increased TER. We also compared the 
relative abundance of nascent circRNAs 

with steady-state levels in H9 cells and FB 
neurons and first identified nascent circRNAs 
with increased expression upon FB neuron 


©: 


ifferentiation. Although the total number 


and steady-state expression level of linear 
mRNAs remained largely unchanged, the 


steady-state levels of circRNAs significantly 
increased in terms of their total number and 
expression upon neuronal differentiation. 
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Thus, the synthesis of circRNAs from rapidly 
transcribed circRNA-producing genes and 
their accumulation lead to the detection of 
upregulated steady-state circRNAs in neurons 
that have slow division rates. 


Recent research into circRNA biogenesis has 
shown that back-splicing is catalyzed, though 
inefficiently, by the canonical spliceosomal 
machinery and modulated by both cis- 
elements and trans-factors. For this reason, the 
identification of back-splice junctions is crucial 
to annotate circRNAs. With the intrinsic feature 
of being covalently closed without open ends, 
circRNAs are largely missed by polyadenylated 
transcriptome profiling but can be captured 
by RNA deep sequencing (RNA-seq) from 
nonpolyadenlyted RNAs. Our group has 
revealed that a single gene locus can produce 
multiple circRNAs through alternatively back- 
splicing (circularization) by a mechanism 
associated with the competition of putative 
RNA pairs across introns that bracket the 
circle-forming exons. Theoretically, there are 
two types of alternative back-splicing. One 
type is alternative 5’ back-splicing, in which 
two or more 5’ downstream back-splicing 

sites alternatively link to the same upstream 

3' back-splice site in a reversed orientation. 
The other one is alternative 3’ back-splicing, 

in which two or more upstream 3’ back-splice 


sites alternatively link to the same downstream 
5’ back-splice site in a reversed orientation. 
Apparently, alternative back-splicing further 
expands the complexity of circRNA formation; 
however, the detailed mechanism of 
alternative back-splicing is largely unknown. 
Like most multiexonic genes undergo 
alternative splicing to generate multiple (linear) 
mRNAs, alternative splicing can also expand 


the diversity of circRNAs. For example, some 
cassette exons are more favorably included 
in circRNAs than in linear mRNAs, and some 
introns are retained in circRNAs through 
alternative splicing. 


We applied CIRCexplorer2, an upgraded 
computational pipeline, to identify both 
alternative back-splicing and alternative 
splicing events in circRNAs from various 
poly(A)— RNA-seq datasets. Through a 
comparison with parallel poly(A)+ RNA-seq 
datasets in the same cell lines, thousands 

of alternatively back-spliced and circRNA- 
predominat alternatively spliced exons, 
including previously unannotated ones, were 
identified in circRNAs with variable expression 
patterns form different poly(A)- and poly(A)-/ 
RNaseR RNA-seq datasets. Together, the 
diverse landscape of alternative back-splicing 
and alternative splicing in circRNAs provides a 
valuable resource for depicting the complexity 
of circRNA formation. 


With CIRCexplorer2, more than 10,000 circRNAs 
were identified, and their downstream 5’ back- 
splice sites were accordingly annotated. Of the 
highly expressed circRNAs with mapped back- 
splice junction reads >= 0.1 RPM (mapped 
back-splice junction Reads Per Million mapped 
reads), up to 30% were alternatively back- 
spliced, which suggests that alternative back- 
splicing is widely distributed in circRNAs. A 
computational analysis of the orientation- 
opposite complementary sequences revealed 
that more than 70% of the highly expressed 
circRNAs (junction reads >= 0.1 RPM) with 
alternative back-splice site selection contained 
the paired intronic complementary sequences 
flanking both the proximal and the distal 
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5/3’ back-splice sites; in comparison, ~20% 
of randomly selected non-alternative back- 
splicing events were flanking with paired 
intronic complementary sequences. Clusters 
of proximal and distal RNA pair across introns 
were seen in many gene loci, and this strong 
competition between these potential RNA 
pairs was correlated with the detected 
alternative back-splicing events. This results 


that further validated by experiments provide 
yet another line of evidence demonstrating 
that cis-elements can significantly affect the 
biogenesis of circRNAs. 


Our computational pipeline CIRCexplorer2 
achieved de novo assembly from unmapped 
reads from poly(A)— and/or poly(A)-/RNase 
R RNA-seq datasets and revealed that 

many alternative 5/3’ back-splice sites from 
previously unannotated exons (non-RefSeq, 
non-UCSC Known Genes, or non-Ensembl) 
were predominantly detected in circRNAs. 
We further comfirmed some of these novel 
back splice sites by Sanger sequencing after 
RT-PCR amplification and by Northern blot 
analysis. With the upgraded CIRCexplorer2 
pipeline we identified all four basic types of 
alternative splicing events from the examined 
cell lines and further quantitated the extent 
of different types of alternative splicing by 
PSI (Percent Spliced In) for cassette exon 
selection, PIR (Percent Intron Retention) for 
intron retention, and PSU (Percent Splice-site 


Usage) for alternative 573’ splice site selection. 


All the splicing events with more selection in 
circRNAs than those in their linear cognates 
were counted. These analyses revealed 

that 20%-30% of the circRNA-specific/- 
predominant alternative splicing events 
could be detected in multiple examined cell 


lines. These circRNA-predominant cassette 
exons were generally less conserved than 
constitutive exons and cassette exons that 
were identified in linear mRNAs. 


3. Function/Biological impact of circRNAs 


One of ncRNAs’ functions that we have 
focused on is circRNA-derived pseudogenes. 
Circular RNAs (circRNAs) from pre-mRNA back- 
splicing have been recently re-discovered 
genome-wide in metazoans by taking 
advantage of RNA deep-sequencing (RNA-seq) 
of the non-polyadenylated transcriptomes 
and specific computational pipelines that 

can retrieve the non-colinear back-splicing 
junction reads. Despite being inefficiently 
processed and mostly expressed at a low level, 
some circRNAs are derived from genomic 

oci associated with human diseases, and 
increasing lines of evidence have suggested 
their potential roles in transcriptional, post- 
transcriptional and translational regulation. 


However, nothing is known about whether 


these exceptionally stable circRNAs can be 
retrotranscribed, and ultimately inserted 
back into the host genome as processed 
pseudogenes. 


Theoretically, a linear mRNA-derived 
pseudogene keeps the same sequential 
order (colinear) of exons as in its parent linear 
mRNA (Figure 4A). In contrast, a circRNA- 
derived pseudogene would have an exon- 
exon junction in a reversed order (non- 
colinear) (Figure 4B). By taking advantage of 


this feature, we developed a computational 
pipeline (CIRCpseudo) to identify potential 
circCRNA-derived pseudogenes in the mouse 


reference genome (Figure 4C) (Dong, et al., 
Cell Research, 2016). Among them, at least 33 
pseudogenes are possibly derived from the 
same circular RNA at the RFWD2 (ring finger 
and WD repeat domain 2) locus (circRFWD2) 
with the characteristic non-colinear back- 
splicing junction sequences that anchor exon 
6-exon 2 in a reversed order (Figure 4D). We 
referred to these 33 pseudogenes as “high- 
confidence circRFWD2-derived pseudogenes” 


owing to the existence of the non-colinear 
exon 6-exon 2 junction sequence. 


We speculated that the insertion of 
retrotransposed circRNAs might be involved 
in gene expression regulation by altering 
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the genome structure. Intriguingly, a CTCF/ 
Rad21-binding site in the mouse MEL and 

GIE cell lines was identified to overlap exactly 
with circSATB1-derived pseudogene locus 
(Figure 4G). As CTCF binding could affect 
chromosome configuration and thus regulate 


gene expression, this finding may indicate 
an unexpected biological effect originated 
from circRNAs. Together, we demonstrated 
that pseudogenes can be retrotransposed 
from circRNAs and, importantly, inherited in 


mammalian genomes. Their existence in the 
genome has the potential to reshape genome 
architecture by providing additional CTCF- 
binding sites. 


Figure 4. Characterization of circRNA-derived pseudogenes and their potential role in reshaping 
genome architecture (Adapted from Dong, et al., Cell Research, 2016). 
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Future Perspective 


To get a comprehensive understanding of 
the complexity of transcriptome, we plan 

to study the function of IncRNAs in depth. 
We are developing new computational 
methods combined with diverse biochemical 
experiments to investigate the important 
roles of these IncRNAs in cells. By applying 
our methods to diverse datasets we can 
continuously update our annotation of 
IncRNAs, specially, using our methods in 
specific cell lines from different tissues or 
different species, we would further improve 
the identification of tissue/species-specific 
IncRNAs as marker genes or the understanding 
of evolution and conservation. Together, these 
newly annotation will be collected to form 
database and online afterwards. With the 
application of high throughput sequencing 
technologies and biochemical labeling 
methods we are in the progress of studying 
the structure of IncRNAs. Moreover, with 

the information of RNAs structures we are 
decoding the RNA interaction network. 


Il. Novel function of ADAR on 
miRNA biogenesis and neural 
differentiation 


Adenosine deaminases acting on RNA (ADARs) 
are involved in adenosine-to-inosine RNA 
editing and are implicated in development 
and diseases. ADARs interact with several 
types of dsRNA substrates including small RNA 
precursors. Editing of miRNA precursors can 
interfere with miRNA biogenesis or alter target 
specificity of edited mature miRNA. Besides 
the catalytic activity of ADARs on miRNAs, it 
has been shown that ADARs can modulate 


the miRNA/sIRNA pathways independently of 
the editing activity in fly, worms and embryo 
of mouse. Recently, two research groups 
showed that ADAR1 interacts with different 
components of the miRNA biogenesis 
pathway, and exerts different effects on 
miRNA production. Although these studies 
clearly suggested that the non-catalytic 
activity of ADAR1 and the miRNA biogenesis 
pathway are connected at multiple levels, the 
underlying mechanisms still require further 
investigation. 


As an essential protein, ADAR1-deficient mice 
are embryonically lethal. In addition, mutations 
in human ADAR1 gene are shown to be 
associated with Aicardi-Goutiéres syndrome, 
an early-onset encephalopathy that often 
results in severe and permanent neurological 
damage, indicating that ADAR1 may play an 
important role during neural development in 
humans. 


To illustrate the functions and mechanisms of 
ADAR1 in differentiation and neural induction, 
we generated hESCs lacking ADAR1 and 
examined their ability to differentiate into 
specific types of neurons, followed by RNA- 
seq to systematically compare mRNA and 
miRNA changes between wild-type (WT) and 
ADARI1-deficient cells at several differentiation 
time points (Chen, et al., Cell Research, 2015). 
We observed that ADAR1 deficiency in human 
embryonic stem cells (RESCs) significantly 
affects hESC differentiation and neural 
induction with widespread changes in mRNA 
and miRNA expression, including upregulation 


of self-renewal-related miRNAs, such as 
miR302s. 
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Global editing analyses revealed that ADAR1 
editing activity contributes little to the altered 
miRNA/mRNA expression in ADAR1-deficient 
hESCs upon neural induction. Genome-wide 
iCLIP studies identified that ADAR1 binds 
directly to pri-miRNAs to interfere with miRNA 
processing by acting as an RNA-binding 
protein. Importantly, aberrant expression of 
miRNAs and phenotypes observed in ADAR1- 


ADAR1 KD 
ADAR' acts 
as an RBP 


miR302s otc. + 


Rescued by ADAR1-E912A, but not ADART-EAA 


depleted hESCs upon neural differentiation 
could be reversed by an enzymatically inactive 
ADAR1 mutant, but not by the RNA-binding- 
null ADAR1 mutant. These findings reveal that 
ADARI, but not its editing activity, is critical for 
hESC differentiation and neural induction by 
regulating miRNA biogenesis via direct RNA 
interaction (Figure 5). 


| 
< ne a 


Aberrant 
differentiation 
and neural 

aon 


Figure 5. A model for ADAR1 regulation in hESC self-renewal and differentiation (Adapted from Chen, et al, 


Cell Research, 2015). 


Future Perspective 


We found that ADAR1 binds directly to pri- 
miR302s, competes with DGCR8 and thus 
inhibits miRNA processing. While it remains 
unknown how ADARI selects its pri-miRNAs 
targets, the binding sites of ADAR1 on pri- 
miRNAs appear to be degenerate. For 
example, ADAR1 largely binds to apical 
junction regions of pri-miR302s, but it binds 
to the predicted DROSHA cleavage sites of 
pri-miR1260s. Further studies are needed to 
answer this question. In addition, as no A-to-| 
editing was identified in primary miR302s, it 
further suggests that ADAR1 binding may not 
always lead to adenosine deamination; for 
instance, it is worth studying whether ADAR1 


could bind to dsRNAs formed across introns 


and promote exon circularization. 


We have also observed that the expression 
of some miRNAs was remarkably repressed 
during neural differentiation in ADAR1- 
deficient HESCs. For instance, let-7 family 
miRNAs were significantly downregulated in 
ADAR1 KD samples in late stage of neuronal 
differentiation. However, more researches 
are needed since we cannot exclude the 
possibility that the downregulation of let-7 was 
indirect and caused by neural differentiation 
repression by ADAR1 depletion. 
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General Information 


Honors/Awards to group 
leader: 


+ 2014 Meiji Life Sciences Award, China 


+ 2015 CAS Award for Outstanding Mentors 
(PhD student Supervisors), China 


+ 2015 University of CAS-BHPB Award 
for Outstanding Mentors (PhD student 
Supervisors), China 


+ 2015 Young and middle-aged leading 
scientists, engineers and innovators by 
MoST, China 


+ 2016 Ranked as “Excellent” for “Hundred 
Talents” of 2012 by CAS, China 


Honors/Awards to group 
students: 


+ Xiao-Ou Zhang 


2014 National Scholarship for Outstanding 
Graduate Student, Ministry of Education of 
the People’s Republic of China 


2014 Pacemaker to Merit Student, 
University of Chinese 


2014 First author award of PICB 


2015 CAS President Award (Special Prize), 
Chinese Academy of Sciences 


2015 UCAS-BHPB Scholarship, University 
of Chinese Academy of Sciences & BHP 
Billiton 


2015 Outstanding poster second prize in 
the 8th biennial meeting of Chinese RNA 
society 


2015 Rui Wu Prize 


+ Rui Dong 


2015 Merit Student, University of Chinese 
Academy of Sciences 


2016 National Scholarship for Outstanding 
Graduate Student, Ministry of Education of 
the People’s Republic of China 


2016 Chinese Academy of Sciences & 
Zhu-Li-Yuehua Outstanding Doctoral 
Scholarship in 2016 


* Qin Yang 


2016 Merit Student, University of Chinese 
Academy of Sciences 


+ Xu-Kai Ma 


2016 University of Chinese Academy of 
Sciences outstanding students 


Publications (*co-first authors, 


co-corresponding authors) 


-© Chen LL’ and Yang L’. AlUternative 


regulation for gene expression. Trends Cell 
Biol, 2017, doi: 10.1016/j.tcb.2017.01.002 


- Dong R*, Ma XK*, Chen LL’, Yang L'. 


Increased complexity of circRNA expression 
during species evolution. RNA Biol, 2017, 
0(0): 1-11 


+ Niu N, Xiang JF, Yang Q, Wang L, Wei 


Z, Chen LL, Yang L, Weiguo Zou. RNA 
binding protein SAMD4 regulates skeleton 
development through translational 
inhibition of MIG6 expression. Cell Dis, 2017, 
3: 16050 


+ Wu H*, Yin QF*, Luo Z*, Yao RW, Zheng 


CC, Zhang J, Xiang JF, Yang L, Chen LL. 
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Unusual processing generates SPA IncRNAs 
that sequester multiple RNA binding 
proteins. Mol Cell, 2016, 64:534-548 (Cover 
Article and Issue Highlight) Editorial by: Li R 
and Fox AH. Mol Cell, 2016, 64:435-437 


- Luo Z, Yang Q and Yang L’. RNA structure 


switches RBP binding. Mol Cell, 2016, 
64:219-220 (Preview) 


+ Zhang XO*, Dong R*, Zhang Y*, Zhang 


JL, Luo Z, Zhang J, Chen LL' and Yang 

Lİ. Diverse alternative back-splicing and 
alternative splicing landscape of circular 
RNAs. Genome Res, 2016, 26: 1277-1287 


+ Dong R, Zhang XO, Zhang Y, Ma XK, Chen 


LL, Yang L’. CircRNA-derived pseudogenes. 
Cell Res, 2016, 26:747-750 


+ Zhang Y*, Xue W*, Li X, Zhang J, Zhang 


JL, Yang L', Chen LL" The Biogenesis of 
nascent circular RNAs. Cell Rep, 2016, 15:611- 
624 


+ Zhong C, Xie Z*, Yin Q*, Dong R, Yang S, 


Wu Y, Yang L and Li J. Parthenogenetic 
haploid embryonic stem cells efficiently 
support mouse generation by oocyte 
injection. Cell Res, 2016, 26:131-134 


- Zhang Y, Yang Lt and Chen LL" 


Characterization of circular RNAs. Methods 
Mol Biol, 2016: 1402, 215-227 (Book chapter) 


- Yang L' Splicing noncoding RNAs from the 


inside out. Wiley Interdiscip Rev RNA, 2015, 
6: 651-660 (Review) 


+ Brooks AN, Duff MO, May G, Yang L, 


Bolisetty M, Landolin J, Wan K, Sandler 
J, Celniker SE, Graveley BR and Brenner 
SE. Regulation of alternative splicing in 
Drosophila by 56 RNA binding proteins. 
Genome Res, 2015, 25: 1771-1780 


+ Xiang JF, Yang L and Chen LL. The long 


noncoding RNA regulation at the MYC 
locus. Curr Opin Genet Dev, 2015, 33: 41-48 
(Review) 


+ Zhong C*, Yin Q*, Xie Z*, Bai M*, Dong R*, 


Tang W, Xing YH, Zhang H, Yang S, Chen 
LL, Bartolomei MS, Ferguson-Smith A, Li 
D, Yang Lt, Wu Y” and Li J". CRISPR-Cas9- 
mediated genetic screening in mice with 
haploid embryonic stem cells carrying a 
guide RNA library. Cell Stem Cell, 2015, 17: 
1712 


- Chen LL? and Yang L’. Gear up in circles. 


Mol Cell, 2015, 58: 715-717 (Preview) 


- Chen LL! and Yang L’. Regulation of 


circRNA biogenesis. RNA Biol, 2015, 12: 381- 
388 (Review) 


+ Chen T*, Xiang JF*, Zhu S*, Chen S, Yin QF, 


Zhang XO, Zhang J, Feng H, Dong R, Li XJ, 
Yang L' and Chen LL". ADARI is required 
for differentiation and neural induction 

by regulating microRNA processing in a 
catalytically independent manner. Cell Res, 
2015, 25:459-476 


+ Hu SB, Xiang JF, Li X, Xu Y, Xue W, Huang 


M, Wong CC, Sagum CA, Bedford MT, Yang 
L, Cheng Dt and Chen LL". Protein arginine 
methyltransferase CARMI attenuates the 
paraspeckle-mediated nuclear retention of 
mRNAs containing IRA/us. Genes Dev, 2015, 
29: 630-645 


+ Yin QF*, Hu SB*, Xu YF, Yang L, Carmichael 


GG and Chen LL. SnoVectors for nuclear 
expression of RNA. Nucleic Acids Res, 2015, 
43:e5 


+ Wang D*, Cai C*, Dong X, Yu QC, Zhang 


XO, Yang L and Zeng YA. Identification 
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of multipotent mammary stem cells by 
protein C receptor expression. Nature, 2015, 
517: 81-84 


- Yin QF, Chen LL" and Yang L’. Fractionation 
of Non-polyadenylated and Ribosomal-free 
RNAs from Mammalian Cells. Methods Mol 
Biol, 2015, 1206: 69-80 (Book chapter) 


© Yang L' and Chen LL". Microexons go big. 
Cell, 2014, 159: 1488-1489 (Preview) 


» Yang L’ and Chen LL". Competition of RNA 
splicing: line in or circle up. Sci China Life 
Sci, 2014, 57: 1232-1233 (Review) 


+ Dong R, Chen LL and Yang L. Research 
progress of circular RNA in the post- 
genome era. Chinese J Cell Biol, 2014, 36: 
1455-1459 (Review, in Chinese) 


+ Zhang XO*, Wang HB*, Zhang Y, Lu X, 
Chen LL" and Yang L‘. Complementary 
sequence-mediated exon circularization. 
Cell, 2014, 159: 134-147 Editorial by: Vicens 
Q and Westhof E. Cell, 2014, 159: 13-14 
Highlighed by: Nat Rev Genet, 2014, 15: 707 


+ Gerstein M, Rozowsky J, Yan KK, Wang D, 
Cheng C, Brown JB, Davis C, Hillier L, Sisu C, 
Li JJ, Pei B, Harmanci AO, Duff MO, Djebali 
S, Alexander RP, Alver BH, Auerbach R, Bell 
K, Bickel PJ, Boech ME, Boley NP, Booth 

BW, Cherbas L, Cherbas P, Di C, Dobin A, 
Drenkow J, Ewing B, Fang G, Fastuca M, 
Feingold EA, Frankish A, Gao G, Good PJ, 
Guigo R, Hammonds A, Harrow J, Hoskins 
RA, Howald C, Hu L, Huang H, Hubbard TJP, 
Huynh C, Jha S, Kasper D, Kato M, Kaufman 
TC, Kitchen RR, Ladewig E, Lagarde J, Lai E, 
Leng J, Lu Z, MacCoss M, May G, McWhirter 
R, Merrihew G, Miller DM, Mortazavi A, 
Murad R, Oliver B, Olson S, Park PJ, Pazin 
MJ, Perrimon N, Pervouchine D, Reinke V, 


Reymond A, Robinson G, Samsonova A, 
Saunders G, Schlesinger F, Sethi A, Slack FJ, 
Spencer WC, Stoiber MH, Strasbourger P, 
Tanzer A, Thompson OA, Wan KH, Wang 

G, Wang H, Watkins KL, Wen J, Wen K, Xue 
C, Yang L, Yip K, Zaleski C, Zhang Y, Zheng 
H, Brenner SE, Graveley BR, Celniker SE, 
Gingeras TR and Waterston R. Comparative 
analysis of the transcriptome across distant 
species. Nature, 2014, 512: 445-448 


« Zhang Y, Yang L and Chen LL. Life without 


A tail: new formats of long noncoding 
RNAs. Int J Biochem and Cell Biol, 2014, 54: 
338-349 (Review) 


+ Zhang XO, Yin QF, Chen LL and Yang 


L. Gene expression profiling of non- 
polyadenylated RNA-seq across species. 
Genomics Data, 2014, 2: 237-241 


+ Zhang XO%, Yin QF*, Wang HB, Zhang Y, 


Chen T, Zheng P Lu X, Chen LL and Yang 
Lİ. Species-specific alternative splicing leads 
to unique expression of sno-/ncRNAs. BMC 
Genomics, 2014, 15: 287 (Highly Accessed) 
Editorial by: BioMed Central portal-Biome 
on 17th April 2014: sno-IncRNAs: a story of 
splicing across humans, rhesus and mice 


+ Xiang JF, Yin QF, Chen T, Zhang Y, Zhang 


XO, Wang HB, Ge JH, Lu XH, Yang L and 
Chen LL. Human colorectal cancer-specific 
CCATI-L IncRNA regulates long-range 
chromatin interactions in the MYC locus. 
Cell Res, 2014, 24:513-531 (Cover Article 
and Issue Highlight) Editorial by: Younger 
ST and Rinn JL. Cell Res, 2014, 48:155-157 
Highlighted by: Nature.com Highlighted by: 
Global Medical Discovery Highlighted by: 
National Science Reviews 
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Invited Talks 


+ 11/2016: RNA Biology, Cold Spring Harbor 


Asia, Suzhou, Jiangsu, China (Selected talk) 


+ 11/2016: 2016 Genetics Conference, 


Genetics Society of China, Wuxi, Jiangsu, 
China (Invited talk and session chair) 


+ 10/2016: 2016 Annual Genetics Conference, 


Genetics Society of Jiangsu, Wuxi, Jiangsu 
China (Invited talk) 


+ 10/2016: NJMS-UH Cancer Center, Rutgers 


University, Newark, NJ, USA (Invited talk) 


+ 10/2016: 7th National Conference on 


Bioinfoamtics and Systems Biology, 
Chengdu, China (Invited talk) 


+ 09/2016: The 9th National RNA Conference, 


Shanghai, China (Invited talk) 


+ 09/2016: The 3rd Global Conference of 


Chinese Geneticists, Hangzhou, China 
(Invited talk and session chair) 


+ 07/2016: The 11th Chinese Biological 


Investigators Society (CBIS) Biennial 
meeting, Chengdu, China 


+ 07/2016: Sino-German Symposium on 


“RNA biology and humandisease: from 


molecular mechanisms to global networks", 
Rauischholzhausen, Germany (Invited talk) 


+ 01/2016: International Conference on 


Cancer Genomics, Hong Kong, China 
(Invited talk) 


+ 11/2015: Cell Symposia on Human 


Genomics, Singapore (Selected talk) 


+ 10/2015: First International Epigenomics 


Conference, Shanghai, China (Invited talk) 


+ 09/2015: RNAi China 2015, Kunshan, Jiangsu 


Province, China (Invited talk) 


+ 09/2015: School of Life Science, 


ShanghaiTech University, Shanghai, China 
(Invited talk) 


+ 09/2015: Annual meeting of Biochemistry 


and Molecular Biology Society of Zhejiang 
Province, Hangzhou, China (Invited talk) 


+ 04/2015: International Symposi 


um of Genetics 2015 Genome, 
Epigenome and Phenome, Shanghai, China 
(Invited talk) 


+ 11/2014: The 16th Tokyo RNA Club meeting, 


RIKEN, Toyko, Japan (Invited talk) 


+ 11/2014: The 37th Annual meeting of 


the Molecular Biology Society of Japan, 
Yokohama, Japan (Invited talk) 


+ 11/2014: RNA Biology, Cold Spring Harbor 


Asia, Suzhou, Jiangsu, China (Selected talk) 


+ 10/2014: Medical School, Tsinghua 


University, Beijing, China (Invited talk) 


+ 09/2014: School of Life Science, Fudan 


University, Shanghai, China (Invited talk) 


+ 08/2014: Regulatory Non-Coding RNAs, 


Cold Spring Harbor, New York, USA 
(Selected talk) 


+ 07/2014: Toward Cutting-Edge: International 


Advanced Summer School for Biochemistry 
and Molecular Cell Biology, Shanghai, China 
(Invited talk) 


+ 06/2014: 8th IUPAP International 


Conference on Biological Physics, Beijing, 
China (Invited talk and co-organizer for 
Session8: Genome Structure and Non- 
coding RNA) 
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Cooperation 


+ Long noncoding RNAs and regulatory 
RNA editing. Prof. Ling-Ling Chen, 
Institute of Biochemistry and Cell Biology, 
SIBS, CAS, Shanghai, China; Prof. Gordon 
G. Carmichael, Department of Genetics 
& Developmental Biology, University of 
Connecticut Health Center, Farmington, CT, 
USA. 


+ Human embryonic stem cells. Prof. Ling- 
Ling Chen, Institute of Biochemistry and 
Cell Biology, SIBS, CAS, Shanghai, China; 
Prof. Xue-Jun Li, Department of Biomedical 
Sciences, University of Illinois College of 
Medicine at Rockford, IL 61107, USA. 


+ Multipotent mammary stem cell. Prof. 
Yi Zeng, Institute of Biochemistry and Cell 
Biology, SIBS, CAS, Shanghai, China. 


+ Skeletal development. Prof. Weiguo Zou, 
Institute of Biochemistry and Cell Biology, 
SIBS, CAS, Shanghai, China. 


Teaching 


+ Bioinformatics and Algorithms, course for 
Ist year graduate students at the Shanghai 
Branch of the CAS Graduate School 
together with Professors from PICB 


+ Summer Courses in Advanced Biology at 
Fudan University with Professors from other 
Chinese universities/institutes 


+ Bioinformatics and Algorithms, course for 
senior undergraduates or junior graduate 
students at ShanghaiTech University 
together with Professors 


External Funding 


- Active: 
“Regulatory network and molecular 
mechanism of key protein factors in tumor 
metabolism’ 
2014CB910600, Ministry of Science and 
Technology of China 
Principal Investigator: Ping Gao (Major 
Investigator: Li Yang) 
01/01/14 — 12/30/18 


“Identification, profiling and function 
analysis of new long noncoding RNAs with 
specific structures” 

31471241, National Natural Science 
Foundation of China 


Principal Investigator: Li Yang 
01/01/15 — 12/30/18 


“Different subcellular localization of long 
noncoding RNAs and their functions” 
91540115, National Natural Science 
Foundation of China 

Principal Investigator: Li Yang 

01/01/16 — 12/30/18 


Lead project on Personalized medicine 
XDA12000000, Chinese Academy of 
Sciences 

Principal Investigator: Jian Ding (Major 
Investigator: Li Yang) 

10/01/15 — 12/30/18 


+ Completed: 
“Whole transcriptome analysis of RNA 
editing and its functional study” 
31271390, National Natural Science 
Foundation of China 
Principal Investigator: Li Yang 
01/01/13 — 12/30/16 
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“Complexity of eukaryotic transcriptomics” 
Hundred Talents Program 

2012OHTPO08, Chinese Academy of Sciences, 
China 

Principal Investigator: Li Yang 

01/01/12 — 12/30/14 


“Functional Study of Human Long 
Noncoding RNA and Its Regulation in 
Disease” 

11PJ1411000, Shanghai Science and 
Technology Commission, China 
Principal Investigator: Li Yang 
10/01/11 — 09/30/13 


“Development of high-throughput 
sequencing strategy with low amount of 
sample inputs” 

YG2012101, Chinese Academy of Sciences, 
China 

Principal Investigator: Li Yang 

01/01/13 — 12/31/13 
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5 Research Support And Central Units 


5.1 Information technology facility 


Staff: 


Guoqing Zhang, Head 
Phone: +86-21-5492 0465 
Email: gqzhang@picb.ac.cn 


Liangxiao Ma, Senior System Administrator 
Qirui Chen, Technician 


Overview 


The computer system at PICB serves as 

a collaborative platform for scientific 

research as well as for regular tasks such as 
communication and document preparation. 
To provide all the required services in a 
seamless way, we run a networked UNIX- 
based platform, with unified user accounts 
for desktops, servers and a high-performance 
cluster. 


Current State of Resources 
and Services 


High-performance cluster 


During the 3 years, some energy intensive 
IBM nodes were removed from cluster, and 
some new computer servers were added to 
the cluster. The current cluster has 3742 CPU 
cores in 219 IBM ,HP and HuaWei nodes. More 
than 1000 cores are located in 48-core AMD 
Opteron nodes and 2700 cores in 64 or 128- 
core Intel Xeon nodes, allowing for large-scale 
parallel computations using shared memory 
up to 2TB. Up to 45% of the cluster are used 


interactively, while the main load is managed 
through the batch job scheduling engine SGE. 
Networked storage is accessed through 10Gbs 
Ethernet through automated NFS mounting, 
using identical paths throughout PICB. For 
maintenance tasks, like remote-lights-out and 
out-of-band login, the IPMI protocol is used 


without separate cabling. 


Storage and Backup 


The networked storage capacity (NAS) is total 
2.3PB, with 200 TB provided on SIX classical 
UNIX (OS X/Linux) NFS servers, 860TB on a 
dedicated commercial scale-out NAS cluster 
by EMC/lsilon , installed in 2011, 1260 TB on a 
dedicated commercial scale-out NAS cluster 
by HuaWei-OceanStor9000 in 2017. It totlal 
provides a available capacity of 1900 TB in 
fourteen SATA-based nodes and 30TB in three 
SATA/SSD-based nodes. Volume-oriented 
workloads and latency-critical workloads 

are distributed to best-suited node types 
automatically. Compared to classical network 


storage servers, the Isilon and Huawei- 
OceanStor system have, by its unique design, 
many proven benefits: high stability, no 
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Figure 1. In order to meet the growing demand, expansion the emc/isilon to 600tb available capacity in 
2015, and purchased the HuaWei OceanStor 9000(6 nodes) in early 2017 , with a total available capacity of 


1600TB. 


downtimes during maintenance or extensions, 
user definable snapshots, high availability 
based on erasure code, balanced distribution 
of capacity and throughput across the system, 
easy capacity planning and provisioning, as 
well as good monitoring and reporting of 
usage and performance figures. 


To protect from data loss in the event of 
a disaster, an IBM TS3500 tape library with 
3PB (compressed) capacity is operated in 


another building on campus, based on 

LTO6 technology and connected via 10Gbs 
Ethernet. Nightly backups are automatically 
scheduled and performed by Symantec 
etBackup. Data restores (excluding tests) are 
almost never performed---this is simply due 
to the high stability of the storage, and due to 


the ability to restore accidentally deleted files 
from on-disk snapshots directly on storage. 


Network and server rooms 


The core network infrastructure is managed 
by the SIBS network center through a DCN 
(Digital China Networks) router based on 
1GBs Ethernet providing three major subnets 
to PICB: one for IT-maintained machines 
including the HPC cluster, one for user- 
maintained machines and one for WiFi access. 
Each machine, including desktop computers, 
has a dedicated 1 Gbs Ethernet port. Within 
PICB traffic is managed by a 10 Gbs non- 
routing DCN core switch and about thirty end- 
level switches (CISCO & DCN). All uplinks to 
the HPC cluster switches and about one third 
of the floor switches are now using 10Gbs 
connections. 


PICB runs three major server rooms: two 
office-sized rooms in the 3rd floor and a 
double-sized room in the 1st floor. Holding 23 
racks in total, the infrastructure is protected 
though uninterruptable power supplies with a 
total of more that 150 kW and five “precision” 


air-condition units. 
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Desktops 


The IT group manages more than 150 desktop 
computers running Apple OS X or Linux 
Ubuntu. Users experience and integrated 
research environment using identical accounts 
and storage and software paths with the 
cluster. In addition, we provide and control 
networks for 150 Laptops and Windows 

PCs. Each office is equipped with a network- 
accessible laser printer. 


Services and Software 


While the infrastructure operates the basic 
network services (DNS, DHCP, NTP), user 
accounts and storage are provided through a 
central directory service (LDAP) and network 
file systems (automounted NFS). The core 
functions are continuously monitored to 
detect and avoid bottlenecks and outages. In 
particular, file server traffic is analyzed by user, 
client and file system location, which often 
helps advising users to optimize the software 
they are running on the cluster. 


On top of these basic services, there are 
generic application services such as office and 
scientific document processing (Microsoft 
Office for Mac, LaTeX), E-Mail (Cyrus/Postfix, 
for POP/IMAP/SMTP, SquirrelMail as webmail 
client, Mailman for mailinglists), printing 


(CUPS), web servers (Apache, Tomcat, 
MediaWiki), databases (PostqreSQL, MySQL), 
storage backup (BakBone NetVault), and job 
scheduling (SGE grid engine). A variety of 
specific software programs for bioinformatics 
(NCBI suite including BLAST, EMBOSS, 
BioConductor), mathematics (R, Maple, Matlab, 
CPLEX), biochemistry (including Gromacs) and 


software development (GCC suite, Eclipse, 
Netbeans), both open-source and commercial, 
are available via network file services 
throughout PICB. In total, there are 600 major 
and minor packages installed for Linux, and 
450 for Apple, plus more than 600 R packages. 


Some new web services were online 
from 2016, the web portal of Bio-Med 
Big Data Center, Nationon Omics Data 
Encyclopedia(NODE) were online in 
http://www.biosino.org. New databases 
and technologies were applied, such as 
MongonDB, SOLR/Lucene, et al. 


The IT group assists in a) efficiently using 

the network storage, the cluster and any 
software, b) setting up researchers’ own web- 
based services, c) connecting self-maintained 
Laptops/PCs to the basic network services 
(Internet, E-Mail, printing). Further internal 
services developed and operated by the IT 
group include a network equipment and 
asset Management system and a content 
management system for the PICB website, 


used by non-technical staff in the PICB 
administration. 


Development and Future 
Perspective 


Development 


As the fast-growing amounts of biomedical 
big data, the current storage capability are 
becoming bottle-neck of PICB. Even 1PB 
Huawei OceanStor 9000 storage were added, 
the unused storage are not more than 200TB 
in 4 months. As the budget of OceanStor 
9000 storage were from the Chinese 
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Government's “Equipment Purchase & Renew” 
program(2016-2018), PICB need new budget to 
extend the storage. 


Nevertheless, the IT groups keeps evaluating 
new storage systems which might 
complement the infrastructure for specific 
workloads. In particular, the free-of-charge 
“Fraunhofer-Gesellschaft Filesystem” (FhGFS), 
which runs on inexpensive standard hardware, 
seems promising in terms of capacity, 
performance, low complexity and price, at the 
cost of lower reliability. 


The HPC cluster has slowly grown since 2014 
as the shared resources had already taken a 
large increase just in 2014. In total, two new 
nodes (each 48 cores, 2048GB RAM) have been 
added by PICB, and some fat nodes(memory 
more than 128GB) were added by individual 
research groups,. It’s no noteworthy 
improvement about hardware accelerated 
computation. GPU servers were used by two 
or three PI teams, but not involved into HPC 
clusters. 


As the new development direction of Bio- 
Med Big Data, web servers are severely 
under funded. During 2014-2017, PICB invert 
many IT resources to HPC, some resources to 
storage, and few resources to web servers. 
Howerver, big data sharing services depend 
on web servers and database servers. These 


requirements were under consideration. 


Future Perspective 


For further demands of storage and of 
computational power we have laid a strong 
foundation. Distributed storage, such EMC/ 


Isilon cluster and Huawei OceanStor 9000, 
will be extended to 20PB. At the same time, 
more super fat nodes (48 CPU cores, >2TB 
memory) will be joined into HPC cluster. 


Cloud computing environment based on 
open source software will be deployed in next 
3 years. Docker will be widely used in web 
servers, databases servers, application servers, 
algorithms servers. Accordingly, docker cluster 
bases on commercial or non-commercial 
solution will be tested, such as docker swarm, 
cousul. If testing result meet requirements, 

the old job scheduling software(SGE) will be 
deprecated and new generation job scheduler 
will be online. 


What's more, deep learning and Al 
environment will be tentative deployed. Some 
GPU servers wih Tesla P100 chips, FPGA chips, 
and TPU service from Google will be tested 
and evaluated. 


All equipment budget will come from 


Shanghai special funds, Guizhou special 


tou 


funds, Chinese Government's “Equipment 
Purchase & Renew” program and research 
programs, total plan will be up to 40 million 
RMB. In order to build national biomedical 
big data infrastructure, the IT group will be 
reset, it will be named as high performance 
storage and computing platform. 2 new 
platform, bioinformatic analysis platform 
and biomedical database R&D platform, are 
building and the members are recruiting. The 
team member will be up to at least 12. 
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General Information 
Cooperation 


+ SIBS Network Center, Director Miao, Mr. 
Qiming Sun: Network design and imple- 
mentation, operation of Cisco and DCN 
network equipment, Internet access; 


+ Center for Biotechnology, Bielefeld Univer- 
sity, Dr. Alexander Goesmann: High-per- 


formance computing cluster design and 
operation 


+ Heidelberg Institute for Theoretical Studies, 
Johannes Wagner: High-throughput dis- 
tributed storage systems 


+ Beijing Genome Institute (BGI), Shenzhen, 
Dr. Fang Lin: Storage and computing in- 
fra-structure design for large scale genomic 
projects 
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5.2 Uli Schwarz Quantitative Biology Core Facility 


Staff: 


Dr. Yujie Chen, Head 


Phone: +86-21-5492 0415 
Email: chenyujie@picb.ac.cn 


Nianhao Cheng, Assistant 


Overview 


The public central Lab of PICB was planned 
in spring 2008 and finally launched on 16th 
October 2008. To commemorate Prof. Dr. Uli 
Schwarz and his great contribution to Sino- 
German scientific collaboration, the public 
central lab of PICB, which was built upon the 
former Max-Planck Guest Lab founded by 
him in 1985, was renamed as the Uli Schwarz 
Laboratory. Its essential role is to provide 

an integrated public wet lab infrastructure 
that fulfils the common demand of wet lab 
experiments in PICB. With the establishment 
of the “Big data center” of the PICB, the Uli 
Schwarz laboratory currently belongs to the 
“Big data center” and renamed as the Uli 
Schwarz Quantitative Biology (Quality Control) 


Core Facility, which also supplies standardized 
test and data quality control for the “Big data 
center”. 


The Uli Schwarz Quantitative Biology Core 
Facility has a total space of about 260 m°, 
and is located on the 2nd and 5th floor of 
the main building. Over the last three years, it 
developed from a conventional experimental 
lab to a multi-function core facility, which 
consists of toxic reagents room, DNA room, 


OO 


Uli Schwarz Laboratory 
CAS-MPG Partner lnstitore for 
Shanghai Institutes for Bilepetores” 
Chinese Academy of Seieseer 


RNA room, PCR room, public cell culture room, 
plant tissue culture room (Figure 1), “High- 
throughput quantitative analysis platform’ 
and “High-throughput imaging and analysis 
platform” (Figure 2). 


The Uli Schwarz Quantitative Biology Core 
Facility has enhanced its internal management 
in the past three years in the following ways: 


1) The regulation of the Uli Schwarz 
Quantitative Biology Core Facility is now 
formally installed and enforced; only trained 
and certified users can gain access to the 
core facility. 


2)A price policy and a public booking and 
card system were established. Daily use 
of the instruments is kept in a good order. 
The frequency of instrument usage can be 
easily tracked and examined through the 
card system. The Uli Schwarz Quantitative 
Biology Core Facility is not only open to 
the PICB, but also to the whole China. Until 
now, it has offered sharing service to the 
Chinese academy of science, hospitals, 
universities and biotech firms. Partial daily 
operation and instrument repair costs of 
the Uli Schwarz Quantitative Biology Core 
Facility can be covered by a fee levied on 
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Figure 1. A, B, The public experimental lab. C, D, The public cell culture room. 


instrument usage. 


3) Internal training courses are provided on 
a frequent basis. A full range of internal 
technical support (covering all of the 
existing instruments in the Uli Schwarz 
Quantitative Biology Core Facility) is 
available. This core facility also supported 
the activity to popularize scientific 
knowledge and carried on the scientific 
summer camp for the middle school 
students. 


4) A website for the core facility has been 
established, which briefly introduces its 
main functions and instruments. 


5)A standard experimental record book 
for all groups was created for the better 
management of scientific documents. 


The Uli Schwarz Quantitative Biology Core 
facility is well equipped with all essential 
instruments for daily experiments. Three years 
ago, the major instruments included: 


+ NanoDrop ND-1000 for detecting trace 
samples. 


* Covaris $220: an instrument to destruct 


tissue samples and homogeneously 
fracture nucleic acid samples. 


+ Agilent 2100 Bio-analyzer for DNA, RNA and 


protein analyses using high-throughput 
microchip capillary electrophoresis. 


+ Roche LightCycler 480 for real-time PCR 


analysis. 


+ Leica Dissecting Microscope for tissue 


dissecting and micro injection. 


+ Zeiss Observer Z1: an inverted phase- 


contrast fluorescent microscope with wide 
applications. 


+ Bio-Tek Synergy H1 microplate reader: a 


multifunctional microplate reader system 
for fluorescence and absorption light. 


+ ABI Laser Capture Micro dissection (LCM) 


system: for the micro dissection of a single 
cell, a specific group of cells, fluorescent 
cells or specific tissue. 


+ Eppendorf epMotion 5070: an efficient 


automated pipetting system platform 
designed for cell culture applications and 
higher bio-security requirements. 
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+ Cellometer automated cell counter: for 


quick counting of living and dead cells. 


+ BEX CUY21 electroporator for 


electroporation of different tissues, 
embryos and cells. 


Over the last three years, the Uli Schwarz 
Quantitative Biology Core facility has further 
extended its functionalities as listed below: 


1) Built up “High-throughput quantitative 


analysis platform”, which newly equipped: 


+ Enzymatic system (PE JANUS work station+ 


microplate reader) for enzymatic activity 
analysis. 


+ Bio-Mark HD high-throughput gene 


analysis system. 


+ Microfluidic chip technology system 


for research on aging: includes PDMS 
microfluidic chips module, image 


——— 


acquisition and analysis module, 
automation control system integration 
module. 


2) Built up “High-throughput imaging and 


analysis platform’, which newly equipped: 


+ VisioSPHERE 3D-imaging system for precise 


3D face imaging and analysis. 


+ Zeiss LSM880 two-photon confocal 


microscope for normal or thick tissue slice, 
cells and living animal imaging. 


+ Zeiss Lightsheet Z.1 fluorescent microscope 


for living Zebra fish, Drosophila embryo, 
CUBIC/CLARITY treated mouse embryo, 
tissue and whole brain imaging. The Uli 
Schwarz Quantitative Biology Core facility 
has grasped the CUBIC method for the 
mouse embryo, tissue and whole brain, 
could give a good support on transparent 
sample preparation and imaging. 


The total using time of the big sharing 


Figure 2. A, High-throughput quantitative analysis platform, Enzymatic system (PE JANUS work station+ 
microplate reader). B, High-throughput imaging and analysis platform, Zeiss Lightsheet Z.1 fluorescent 
microscope (left) and Zeiss Observer Z1 (right). C, Bio-Mark HD high-throughput gene analysis system. D, Zeiss 
LSM880 two-photon confocal microscope. E, VisiosPHERE 3D-imaging system. 
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instruments also grew with the years (Figure 
3). 


Using time{h) 


Figure 3. The total using time of big sharing 
instruments in 2014, 2015 and 2016. 


In 2016, the Uli Schwarz Quantitative Biology 
Core Facility started to get acknowledgments 
in paper: 


+ Gu J, Wang D, Zhang J, Zhu Y, Li Y, Chen 
H, Shi M, Wang X, Shen B, Deng X, Zhan 
Q, Wei G, Peng C. (2016). GFRa2 prompts 
cell growth and chemoresistance through 
down-regulating tumor suppressor gene 
PTEN via Mir-17-5p in pancreatic cancer. 
Cancer Lett. 2016 Oct 1;380(2):434-41.. 


+ Xie J, Zhu Y, Chen H, Shi M, Gu J, Zhang J, 
Shen B, Deng X, Zhan X, Peng C. (2016)The 
Immunohistochemical Evaluation of Solid 
Pseudopapillary Tumors of the Pancreas 
and Pancreatic Neuroendocrine Tumors Re- 
veals EROILB as a New Biomarker. Medicine 
(Baltimore). 2016 Jan;95(2):e2509. 
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5.3 Omics Core Group 


Staff: 


Changpeng Xin, Head 
Phone: +86-21-5492 0545 
Email: xinchangpeng@picb.ac.cn 


Hua Feng, Technician 
Han Yan, Technician 
Mingshuang Li, Technician 


Overview 


PICB Omics Core was established in 2012, 

led by Dr. Li Yang and ChangPeng Xin, as a 
service platform and has been enhanced 
and supported by PICB administration, IT 
and UliSchwarz Central Laboratory. Omics 
Core provides both library preparation and 
following high throughput deep sequencing 
services. Bioinformatics data analyses such 

as RNA-Seq analysis, ChIP-seq analysis, etc. 
are also supported. Omics Core is to provide 
high quality next generation sequencing 
service and bioinformatics service not only to 
researchers at PICB, but also from the other 


institutions. 


Current State of Resources 
and Services 


Instruments 


Right now, the Core has equipped with one 
illumine NextSeq 500, illumina HiSeq 2000, 
cBot, Tru Temp Microheating System, High 
Speed Micro Plate Shaker, centrifuges, Qubit, 
incubator, micro oven and other auxiliary 
instruments (Figure 1). 


Flow of Service 


Basically, PICB Omics Core provides 1x50, 

1x75, 1x100, 2x75, and 2x150 sequencing 

run. Standard workflows are available for 
genome, metagenome, transcriptome, exome, 
small RNA, chromatin immunoprecipitation 
products, DNA methylation, etc. It includes the 
library preparation, the sequencing and the 
bioinformatics data analysis. In general, it takes 
about 4 - 8 weeks to finish a given contract if 


samples are successfully passed the standard 
quality control analysis. We also provide rapid 


service mode which could finish the contract 
(limited samples number) in 1 week (Figure 
2). Please go to our website at www.picb. 
ac.cn/hiseq for details and our contacting 
information. 


Current Service Statistical 
Analyses 


From 2014 to 2017, Omics Core has prepared 
deep sequencing libraries from 235 of DNA, 
1997 of MRNA, 123 of small RNA and 831 of 
ChIP samples for different individually groups 
(Figure 3). Most PICB samples are derived 
from DNA, RNA, small RNA and ChIP, and SIBS 
samples are predominated by RNA and ChIP. 
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Figure 1. Snapshots of illumina NextSeq 500, HiSeq 2000 equipment and other auxiliary instruments 
at PICB Omics Core. 

Top left: HiSeq 2000 sequencing machine; Top right: NextSeq 500 sequencing machine; 

Bottom left 1: Tru Temp Microheating System & High Speed Micro Plate Shaker; 

Bottom left 2: cBot for cluster generation; Bottom left 3: bench for centrifuges, Qubit, incubator, and micro oven; 
Bottom left 4: room for library preparation. 


Flow of service Duration Contacts 


Pre-inquiry and consulting 
By emails Days to months 


By phone calls 


U Yang 


Changpeng Xin, ef al 


Sign an Contract Experimental Partl 
+ Provide sample information Days to months MingShuang Li 
* Sample quality control Han Yan 


Bioinformatics Partl 


Changpeng Xin 
Hua Feng 


Service 


Library preparation Days to 1 months 
Sequencing 
Bioinformatics analysis 


Fulfill the contract 
Quality control 1 Week 
Data delivery and run report 


Figure 2. Flow of deep sequencing service at Omics Core. Each step of service, duration and contacts 
are listed. 
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Figure 3. Overview of library preparation service at Omics Core by sample categories and their 
institutional affiliations. 
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Figure 4. Overview of deep sequencing service at Omics Core by sample categories. 
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Figure 5. Overview of deep sequencing service at Omics Core by month (2014-2017). 
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Figure 6. A good RNA sample examined by Agilent Bioanalyzer 2100. 
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Figure 7. A bad RNA sample examined by Agilent Bioanalyzer 2100. 


Our statistical analysis reveals that RNA 
samples are the most frequently sequenced 
(Figure 4, about 62%). Totally, more than 17.18 
Tera bases of high quality deep sequencing 
data has been generated for different groups 
at PICB, SIBS and other institutes (Figure 5). 


Development and Future 
Perspective 


Library preparation: 


Generally speaking, the original sample quality 
is crucial for a successful library preparation. 

At PICB Omics Core, we have set up a 

standard quality control analysis pipeline for 
all the samples we receive, unless additional 
agreements for special samples. We have 
noticed that this step is the bottleneck step 
during our contract service. It can take months 


for customers to prepare new samples again 
with high quality to reach our standards. 

Please go to our website at www.picb.ac.cn/ 
hiseq for details and consult with us directly. 


As suggested, Qubit is used to examine 
the quantity and Agilent Bioanalyzer 2100 
is used to detect the quality of samples. 
For instance, RIN value of Bioanalyzer 2100 
measurement should be higher than 8 for 
good RNA samples, indicated as Figure 7. For 
those RNA samples with RIN<8 or unexpected 
signal at 5S region (as showed in Figure 7) 
have been proven not to be suitable for 
further processing. We suggest that customers 
redo samples collections or purification, or 
additional agreements should be signed 
for further processing. Quantity and quality 
standards of all kinds of samples are available 
in our website at http://www.picb.ac.cn/ 
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Sequencing run: typically vary in the range of 85 to 95% bases 
above Q30. For illumine Nextseq 500, paired 
150 pb runs typically vary in the range of 85 
of the probability of an incorrect base call. to 90% of bases above Q30 and paired 75 bp 
Higher quality score indicates a smaller runs typically vary in the range of 85 to 93% 
probability that a base is called incorrectly. bases above Q30. Fora paired 75 bp run at 
For illumina HiSeq 2000, paired 100 bp runs PICB Omics core (Figure 8), about 93.6% reads 


typically vary in the range of 80 to 90% of were above Q30 (left panel), and even 84% of 
bases above Q30 and paired 50 bp runs reads reached Q40. 


A quality score (or Q-score) is a prediction 


VOLPI AE Lanes 


ox on 


Figure 8. High quality data generated at Omics Core indicated by Q30. 


Omics Core also applies FastQC for the further quick impression of a dataset along the reads. 
quality control analysis (Figure 9). This offers a 


Se ee eee ee eee ee ee eee er eee 


Read 1 Read 2 


Figure 9. FastQC results of one sample that Omics Core has run. 
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In addition, the technical replicates at PICB 
Omics core are correlated very well (Figure 10), 
indicating a very little systematic error/bias 
introduced by operations and/or the machine. 


Sample 1 Sample 2 


0.996 0.998 


6000 8000 


Technical replicate 1 


Technical replicate 1 
2000 4000 


500 1000 1500 2000 2500 3000 


0 


o 1000 2500 


Technical replicate 2 Technical replicate 2 


As shown in Figure 10, the correlation is above 
0.995 in all four comparisons, courtesy by 
Group of Eukaryotic Transcriptomes at PICB. 
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Technical replicate 2 Technical replicate 2 


Figure 10. High quality sequencing data with high correlations between technical replicates 
(provided by Group of Eukaryotic Transcriptomes at PICB, unpublished data). 


Bioinformatics service 


From 2016 Omics Core provided 
bioinformatics service to scientists in other 
institutes. By the end of May, we have 
provided RNA-Seq analysis, ChIP-Seq analysis, 
Single Cell RNA-Seq analysis, RNA alternative 
splicing analysis, DNA methylation analysis, 
and human whole exome sequencing analysis 


service to different groups in SIBS, CAS, 
Universities and hospitals. Those service could 
effectively guide their follow-up experiments. 


Summary: 


Omics Core aims at promoting the innovation 
and development of PICB in the field of 
computational biology by offering high- 
quality sequencing data at first hand, and at 
the same time we are providing a platform for 


sequencing technical exchanges and technical 
services to other research communities. In the 


last three years, PICB Omics Core has 


1. fulfilled about 9.6 million CNY service 
contracts; 


2. generated more than 17.18 Tera of high 
quality sequencing data, mostly for PICB 


customers; 


3. been acknowledged by publications and 
submitted/prepared manuscripts; 


4. provided bioinformatics service for 
experimental scientists. 
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General information 


Publications used data produced by 
Omics Core (2014 - 2017) 


* Cao C, Xu J, Zheng G, Zhu X-G (2016) 


Evidence for the role of transposons in the 
recruitment of cis-regulatory motifs during 
the evolution of C4 photosynthesis. BMC 

Genomics. doi: 10.1186/s12864-016-2519-3 


+ Chen T, Xiang J-F, Zhu S, Chen S, Yin Q-F, 


Zhang X-O, Zhang J, Feng H, Dong R, 

Li X-J, et al (2015) ADAR1 is required for 
differentiation and neural induction by 
regulating microRNA processing in a 
catalytically independent manner. Cell Res, 
25, 459-476 


+ Dong R, Ma X-K, Chen LL, Yang L (2016a) 


Increased complexity of circRNA expression 
during species evolution. RNA Biol, 1-11 


+ Dong R, Zhang X-O, Zhang Y, Ma X-K, 


Chen L-L, Yang L (2016b) CircRNA-derived 
pseudogenes. Cell Res, 26, 747-750 


+ Green CD, Huang Y, Dou X, Yang L, Liu 


Y, Han J-DJ (2017) Impact of Dietary 
Interventions on Noncoding RNA Networks 
and mRNAs Encoding Chromatin-Related 

Factors. Cell Rep, 18, 2957-2968 


+ He Z, Bammann H, Han D, Xie G, Khaitovich 


P (2014) Conserved expression of lincRNA 
during human and macaque prefrontal 


cortex development and maturation. RNA, 
20, 1103-1111 


+ He Z, Han D, Efimova O, Guijarro P Yu O, 


Oleksiak A, Jiang S, Anokhin K, Velichkovsky 
B, Grunewald S, et al (2017) Comprehensive 
transcriptome analysis of neocortical layers 
in humans, chimpanzees and macaques. 


Nat Neurosci, 20, 886-895 


+ Hu S-B, Xiang J-F, Li X, Xu Y, Xue W, Huang 


M, Wong CC, Sagum CA, Bedford MT, 
Yang L, et al (2015) Protein arginine 
methyltransferase CARM1 attenuates the 
paraspeckle-mediated nuclear retention of 
mRNAs containing IRAlus. Genes Dev , 29, 
630-645 


+ Li Q, Guo S, Jiang X, Bryk J, Naumann R, 


Enard W, Tomita M, Sugimoto M, Khaitovich 
P Paabo S (2016) Mice carrying a human 
GLUD2 gene recapitulate aspects of 
human transcriptome and metabolome 
development. Proc Natl Acad Sci U S A, 
113, 5358-5363 


+ Li X, Liu C-X, Xue W, Zhang Y, Jiang S, 


Yin Q-F, Wei J, Yao R-W, Yang L, Chen L-L 
(2017) Coordinated circRNA Biogenesis 
and Function with NF90/NF110 in 

Viral Infection. Mol Cell. doi: 10.1016/ 
j.molcel.2017.05.023 


+ Li Y, Li G, Görling B, Luy B, Du J, Yan J 


(2015) Integrative Analysis of Circadian 
Transcriptome and Metabolic Network 
Reveals the Role of De Novo Purine 
Synthesis in Circadian Control of Cell Cycle. 
PLoS Comput Biol, 11, €1004086 


+ Liu X, Han D, Somel M, Jiang X, Hu H, 


Guijarro P, Zhang N, Mitchell A, Halene 

T, Ely JJ, et al (2016) Disruption of an 
Evolutionarily Novel Synaptic Expression 
Pattern in Autism. PLoS Biol, 14, €1002558 


+ Lyu M-JA, Gowik U, Kelly S, Covshoff S, 


Mallmann J, Westhoff P, Hibberd JM, 
Stata M, Sage RF, Lu H, et al (2015) RNA- 
Seq based phylogeny recapitulates 


previous phylogeny of the genus Flaveria 
(Asteraceae) with some modifications. BMC 
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Evol Biol, 15, 116 


+ Niu N, Xiang J-F, Yang Q, Wang L, Wei 

Z, Chen L-L, Yang L, Zou W (2017) RNA- 
binding protein SAMD4 regulates skeleton 
development through translational 
inhibition of Mig6 expression. Cell Discov, 3, 
16050 


+ Peng G, Suo S, Chen J, Chen W, Liu C, Yu F, 
Wang R, Chen S, Sun N, Cui G, et al (2017) 
Spatial Transcriptome for the Molecular 


Annotation of Lineage Fates and Cell 
Identity in Mid-gastrula Mouse Embryo. 
Dev Cell, 36, 681-697 


« QIY, Yu J, Han W, Fan X, Qian H, Wei H, Tsai 
YS, Zhao J, Zhang W, Liu Q, et al (2016) 

A splicing isoform of TEAD4 attenuates 
the Hippo-YAP signalling to inhibit 
tumour proliferation. Nat Commun, 7, 
ncomms11840 


+ Su M, Han D, Boyd-Kirkup J, Yu X, Han J-DJ 
(2014a) Evolution of Alu Elements toward 
Enhancers. Cell Rep, 7, 376-385 


+ Su Z-D, Sheng Q-H, Li Q-R, Chi H, Jiang X, 
Yan Z, Fu N, He S-M, Khaitovich P Wu J-R, 

et al (2014b) De novo identification and 
quantification of single amino-acid variants 
in human brain. J Mol Cell Biol, 6, 421-433 


+ Tao Y, Lyu M-JA, Zhu X-G (2016) 
Transcriptome comparisons shed light on 


the pre-condition and potential barrier for 


C4 photosynthesis evolution in eudicots. 
Plant Mol Biol, 91, 193-209 


+ Wang D, Cai C, Dong X, Yu QC, Zhang 

X-O, Yang L, Zeng YA (2015) Identification 
of multipotent mammary stem cells by 
protein C receptor expression. Nature, 517, 
81-84 


+ Wang D, Hou L, Nakamura S, Su M, Li F, 


Chen W, Yan Y, Green CD, Chen D, Zhang 

H, et al (201 7a) LIN-28 balances longevity 
and germline stem cell number in 
Caenorhabditis elegans through let-7/AKT/ 
DAF-16 axis. Aging Cell, 16, 113-124 


+ Wang X, Liu D, He D, Suo S, Xia X, He X, 


Han J-DJ, Zheng P (2017b) Transcriptome 
analyses of rhesus monkey preimplantation 
embryos reveal a reduced capacity for 
DNA double-strand break repair in primate 
oocytes and early embryos. Genome Res, 
27, 567-579 


+ Wei Y-N, Hu H-Y, Xie G-C, Fu N, Ning ZB, 


Zeng R, Khaitovich P (2015) Transcript and 
protein expression decoupling reveals RNA 
binding proteins and miRNAs as potential 
modulators of human aging. Genome Biol, 
16,41 


+ WUH, Yin Q-F, Luo Z, Yao R-W, Zheng GC, 


Zhang J, Xiang J-F, Yang L, Chen L-L (2017) 
Unusual Processing Generates SPA LncRNAs 
that Sequester Multiple RNA Binding 
Proteins. Mol Cell, 64, 534-548 


+ Xiang J-F, Yang L, Chen L-L (2015) The long 


noncoding RNA regulation at the MYC 
locus. Curr Opin Genet Dev, 33, 41-48 


+ Xiang J-F, Yin Q-F, Chen T, Zhang Y, Zhang 


X-O, Wu Z, Zhang S, Wang H-B, Ge J, Lu 
X, et al (2014) Human colorectal cancer- 
specific CCAT1-L IncRNA regulates long- 
range chromatin interactions at the MYC 
locus. Cell Res, 24, 513-531 


+ Xing Y-H, Yao R-W, Zhang Y, Guo CJ, Jiang 


S, Xu G, Dong R, Yang L, Chen L-L (2017) 
<em>SLERT</em> Regulates DDX21 Rings 
Associated with Pol | Transcription. Cell, 169, 
664-678.e16 
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+ Xu J, Bräutigam A, Weber APM, Zhu X-G 


(2016) Systems analysis of cis-regulatory 
motifs in C(4) photosynthesis genes using 
maize and rice leaf transcriptomic data 
during a process of de-etiolation. J Exp Bot, 
67, 5105-5117 


+ Yang Y, Fan X, Mao M, Song X, Wu P, Zhang 


Y, Jin Y, Yang Y, Chen L-L, Wang Y, et al (2017) 
Extensive translation of circular RNAs driven 
by N6-methyladenosine. Cell Res, 27, 626- 
641 


+ Yin Q-F, Hu S-B, Xu Y-F, Yang L, Carmichael 


GG, Chen L-L (2015) SnoVectors for nuclear 
expression of RNA. Nucleic Acids Res, 43, 
e5-e5 


+ Yuan F, Lyu M-J, Leng B-Y, Zheng G-Y, 


Feng ZT, Li P-H, Zhu X-G, Wang B-S (2015) 
Comparative transcriptome analysis of 
developmental stages of the Limonium 
bicolor leaf generates insights into salt 
gland differentiation. Plant Cell Environ, 38, 
1637-1657 


+ Zhang B, Han D, Korostelev Y, Yan Z, Shao N, 


Khrameeva E, Velichkovsky BM, Chen Y-PP 
Gelfand MS, Khaitovich P (2016a) Changes 
in snoRNA and snRNA Abundance in the 
Human, Chimpanzee, Macaque, and Mouse 
Brain. Genome Biol Evol, 8, 840-850 


+ Zhang X-O, Dong R, Zhang Y, Zhang J-L, 


Luo Z, Zhang J, Chen L-L, Yang L (2016b) 
Diverse alternative back-splicing and 
alternative splicing landscape of circular 
RNAs. Genome Res , 26, 1277-1287 


+ Zhang X-O, Wang H-B, Zhang Y, Lu X, 


Chen L-L, Yang L (2017a) Complementary 
Sequence-Mediated Exon Circularization. 
Cell, 159, 134-147 


+ Zhang X-O, Yin Q-F, Chen LL, Yang L 


(2014a) Gene expression profiling of non- 
polyadenylated RNA-seq across species. 
Genomics Data, 2, 237-241 


+ Zhang X-O, Yin Q-F, Wang H-B, Zhang Y, 


Chen T, Zheng F Lu X, Chen LL, Yang L 
(2014b) Species-specific alternative splicing 
leads to unique expression of sno-IncRNAs. 
BMC Genomics, 15, 287 


+ Zhang Y-Y, Li C, Yao G-F, Du LJ, Liu Y, Zheng 


XJ, Yan S, Sun J-Y, Liu Y, Liu M-Z, et al (20176) 
Deletion of Macrophage Mineralocorticoid 
Receptor Protects Hepatic Steatosis and 
Insulin Resistance Through ERalpha/HGF/ 
Met Pathway. Diabetes, 66, 1535-1547 


+ Zhang Y, Xue W, Li X, Zhang J, Chen S, 


Zhang J-L, Yang L, Chen LL (2017c) The 
Biogenesis of Nascent Circular RNAs. Cell 
Rep, 15, 611-624 


+ Zhang Y, Zhang X-O, Chen T, Xiang J-F, 


Yin Q-F, Xing Y-H, Zhu S, Yang L, Chen L-L 
(2017d) Circular Intronic Long Noncoding 
RNAs. Mol Cell, 51, 792-806 


+ Zhao G, Guo S, Somel M, Khaitovich P (2014) 


Evolution of human longevity uncoupled 
from caloric restriction mechanisms. PLoS 
One, 9, €84117 


+ Zhao H, Han Z, Liu X, Gu J, Tang F, Wei 


G, Jin Y (2017) The chromatin remodeler 
Chd4 maintains embryonic stem cell 
identity by controlling pluripotency- and 
differentiation-associated genes. J Biol 
Chem, 292, 8507-8519 


+ Zhong C, Xie Z, Yin Q, Dong R, Yang S, Wu Y, 


Yang L, Li J (2016) Parthenogenetic haploid 
embryonic stem cells efficiently support 
mouse generation by oocyte injection. Cell 
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Res, 26, 131-134 


« Zhong G, Yin Q, Xie Z, Bai M, Dong R, Tang 
W, Xing Y-H, Zhang H, Yang S, Chen L-L, et 
al (2015) CRISPR-Cas9-Mediated Genetic 
Screening in Mice with Haploid Embryonic 
Stem Cells Carrying a Guide RNA Library. 
Cell Stem Cell, 17, 221-232 


+ Zhu K, Lei P-J, Ju L-G, Wang X, Huang K, 
Yang B, Shao C, Zhu Y, Wei G, Fu X-D, et al 
(2017) SPOP-containing complex regulates 
SETD2 stability and H3K36me3-coupled 
alternative splicing. Nucleic Acids Res, 45, 
92-105 
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5.4 Bioinformatics Service Core Facility 


Staff: 


Guangyong Zheng, Head 
Phone: +86-21-5492 0552 
Email: zhenggy@sibs.ac.cn 


Ruifang Cao, Technician 

Shanghao Rong, Technician 

Zhaoyuan Fang, Assistant Research Fellow 
Zhen Yang, Associate Research Fellow 


Overview 


PICB Bioinformatics Core was established in 
Feb, 2017, and research fellow were transferred 
from other PICB research group in Jul, 2017. 
The core will be a service platform, and 
provide life science omics data analysis, 

such as genome, transcriptome, phenome, 
exposome, multidimensional omics data. 
These analysis will include standard analysis 
pipeline, and customized analysis solution. 


Current state 


The core are working on NODE(National Omics 
Data Encyclopedia) project. To carry out basic 
work, including data collectetion from PICB 
staff and external research team. The staff are 
mainly working on microbiome big data. The 
microbiome big data are metagenome data 
from NCBI SRA, EBI meta genome, JGI, MG- 
RAST, 16s RNA data from third party database. 
What's more, some pipelines are integrated, 
such as PARA-Meta, Meta-Storm, which can 
find out 16s RNA data from WGS data, and 
calculate biome similarity between user data 
and reference biome datasets. 


The core is working closely with database 


core. The two teams are developing some 
database together, such as transferable circle 
RNA database, camel sequencing database. 
The task of bioinformatics team are about 
data cleaning, data annotation, and data 
import-to-database. Both databases are 
under developing and will be finished in early 
August. 


Work Plan 


Standard omics data analysis system will 

be build up. The system will not be simpe 
script series, and it will be stable and 

reliable pipeline, the percentage of manual 
intervention will be decreased. Two direction 
in this group will be settle up, one is pipeline 
and solution design, the other is analysis 
result manual check-out and interpretation for 
abalysis result. 


All analysis tasks will be integrated into 
microbiome big database, and precision 
medical reference database. All cores will be 
got through by bioinformatics core, partial 
data will be from sequencing core, some data 
will be from database core, all data will be 
managed and visualised by databases. The 
bioinformation core is the bridge between 
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raw data and public services. 


Cooperation 


Dr Zhen Wang in Yixue Li Group: camel 
sequencing database 


Prof Kang Ning from Huazhong University 
of Science and Technology: Meta-Storm and 
PARA-Meta localization and optimization 


Prof Xiaoyang Zhi from Yunnan University: 
microbiome classification 


Dr Haokui Zhou from Shenzhen Institutes 
of Advanced Technology: micriobiome data 
analysis 


Life Science Data Center from Shanghai Center 
for Bioinformation Technology (SCBIT): system 
design and pipeline deployment. 
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5.5 Bio-Med Database Core Facility 


Staff: 


Guogqing Zhang, Head 
Phone: +86-21-5492 0465 
Email: gqzhang@picb.ac.cn 


Pengyu Wang, Technician 
Shantao Li, Technician 


Overview 


PICB Bio-Med Database group was 
established in Feb, 2017. The group will be 

a service platform, and provide biological 
and/or medical database design, database 
development, database publishment and 
database maintenance. The group will work 
with platform Pls to develop omics-data 
submission and sharing platform, microbiome 
big database series, and precision medicine 
reference database series. On the other hand, 
internal and external PICB/SIBS research PI may 
develop database with the group. 


Current state 


NODE(Nationon Omics Data Encyclopedia) 
project is the mainly work task. It includes 

3 major part: user data center, data sharing 
portal, platform management centter. In user 
data center, all registered user can submit 
data information in 6 perspectives: project, 
experiment, sample .run ,data, and analysis, 
and manage data with private, protected, and 
public modes. Data owner can make response 
to request from other user about protected 
data. In data sharing portal, user can visit 

all public data meta data and raw data, all 


protected meta data, and request protected 
raw data. In platform management center, the 
group can vista all statistical information and 
running information of NODE. 


The microbiome big database series will build 
microbial data matrix, include metagenome, 
16s RNA, taxonomy and phenotype, chemical 
and pathway information, functional 
annotation, microbiome structural annotation. 
The precision medicine reference database 
series will collection nature population cohort 
and major disease population cohort which 
will be build under research projects, genome 
data with host information and phenotype 
data will be managed by unified databases. 
The two database series are design by now, it 
will be totally online in 2-3 years. 


What's more, this core are work closely with 
bioinformation core, the two teams are 
developing some database together, such 

as transferable circle RNA database, camel 
sequencing database. The tasks of database 
team are about data storage, data visualisation, 
data searching and browsing. Both databases 
are under developing and will be finished in 
early August. 
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Work Plan design and database development. 


NODE is online (http://biosino.org/node/). it 
has lots work to do: data volume should be 
up to 5PB in 2-3 years, network transfer speed 
should be up to 50-100 Mb in Shanghai, major 
service and backup site should be set up in 
GuiAn New Area, GuiZhou in 1 year, whose 
network speed can be up to 1Gb, and so on. 


The microbiome big database series and 
precision medicine reference database series 
should be developed in 2-3 years, and some 
topic databases will be released during this 
period. 


The corporations with other teams will be 
carried out, some databases will be developed, 
which may be related with disease multiple 
omics knowledge databases or platform. 


Cooperation 


Dr Zhen Wang in Yixue Li Group: camel 
sequencing database 


Dr Hong Li in Yixue Li Group: lung cancer 
database with Shanghai Chest Hospital 


Prof Kang Ning from Huazhong University 
of Science and Technology: Meta-Storm and 
PARA-Meta localization and optimization 


Prof Xiaoyang Zhi from Yunnan University: 
microbiome classification 


Dr Haokui Zhou from Shenzhen Institutes 
of Advanced Technology: micriobiome data 
analysis 


Life Science Data Center from Shanghai Center 
for Bioinformation Technology (SCBIT): system 
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6. Teaching Activities & Guest Scientists 


6.1 Teaching Activities & the International Max-Planck 
Research School (IMPRS) 


Introduction to the Education Courses 


Program 
The institute contributes regular taught 


The main educational program at PICB is 
the MS/Ph.D. joint program. Students with 
backgrounds of Biology, Mathematics, 


courses to the curriculum of the SIBS graduate 
program. Four official courses “Bioinformatics 


nu nu 


Algorithms’, “Systems Theory’, “Operational 


Computer Science and Physics from the 
top universities in China are recruited. The 
program lasts for 5-7 years, during which 
the students initially spend two years on 
the Master program, and then advance to 
the Ph.D. program after passing the Ph.D. 
qualification examination. In addition to 


internal graduate students, the institute has 
also hosted a number of external doctorate 
and master students, as well as interns at all 
levels. 


Graduate Students and Postdocs 


Research and Optimization” and “Human 
Evolutionary Genetics” are open to all SIBS 
students. In order to promote PICB students’ 
professional knowledge and research skills, 
from September to November in 2015 and 
2016, PICB provided a three-month training 
course on Basic Computational Biology, and 
invited two senior students from PICB to 
deliver this course for 39 first year graduate 
students. 


Training Course 


Since 2016, PICB and CAS key laboratory for 


The institute follows the MS/Ph.D. graduate computational biology has launched a series of 
program of CAS, and currently has 105 training courses on “Advanced Computational 
graduate students, 14 joint educating students Biology” supported by the institute. These 
with Shanghai Technology University, 2 foreign training courses are taught directly by the 
students who are pursuing PhD degree here. front-line researchers and scientists in this 

Of these, 29 students are currently enrolled institute including many group leaders, with 


in the Master program and 76 students in the 


a combination of both theoretical lectures 
Ph.D. program. Over the past three years, 48 and computer practice. The aim of this 


students successfully finished their studies 
and obtained Ph.D. Degrees and 2 students 


series of courses is to provide professional 


training focused on computational biology 
received Master Degrees; 22 postdocs, from 


U.S., U.K., India and Pakistan, have been 
recruited. In addition, PICB provides many 


for students and biomedical researchers who 
have strong interest in high through-put 


sequencing and multi-omics data analysis. 
opportunities to both Chinese students and 


This course has achieved high attendance and 
foreign students for internships. Around 30 


interns have studied at PICB over the past 


great repercussions since it was first launched 
in 2016, reflecting the dominant status and 
three years. influence of Computational Biology in current 


life sciences. The success of this training has 


also significantly improved the visibility of our 
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Oct 24-28,2016, Frontiers of computational biology, Shanghai,China. 


institute in the area of computational biology. 


Scientific Lectures and Student 
Seminars 


Numerous research lectures and student 
seminars are organized and hosted at the 
institute in order to enrich and update 
students’ scientific knowledge. Research 
lectures from leading experts in their fields 
contribute substantially to the educations of 
internal as well as external students. Student 
seminars, which take place every month, 
serve as excellent platform for all students to 
communicate their research. 


International Max-Planck Research 
School 


In 2010, the International Max-Planck Research 
School (IMPRS) was established by PICB 
jointly with the Max-Planck Institute for 
Molecular Genetics and the Free University 


of Berlin. The goal of IMPRS is to enhance the 
scientific atmosphere of PICB and encourage 
international exchanges. IMPRS students are 
offered advanced computational biology 
courses as well as soft skills courses such 

as a scientific writing. IMPRS students are 
given the opportunity for a research visit to a 
German institute and can take part in the Otto 
Warburg Summer School. 


The selection procedure for IMPRS students 

is aligned with the regular CAS procedures. 
Second-year MS/Ph.D. joint program students 
at PICB are generally at the transition from 
doing course work to doing their research 
work. As part of the CAS program they have to 
take a PhD qualifying exam. After they passed 
this exam, they can additionally apply for the 
IMPRS program and then get interviewed by a 
committee consisting of internal and external 
referees. There are approximately 10 slots per 
year and by 2017 a total of 85 PICB students 
were accepted into the program (Appendix |). 
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PICB IMPRS Student List 
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Shijun Xiao Yimin Tao 
Yandong Yin Beifei Zhou 
Yuning Wei Hang Zhou 
Xianbin Yu Jiajia Xu 
2010 Yu Wang 2011 Qingfeng Song 
Xiao Cui Yichi Xu 
Hang Xiao Zhisong He 
Jun Li Sile Hu 
Xinyi Yang Ran Li 
Zenghua Fan Xiaoyuan Zhou 
Ruiging Fu Xiaoyang Dou 
Lingfeng Gou Hongwen Xuan 
Yi Huang Lu Qiao 
Mingju Lv Xiao'ou Zhang 
Chen Ming Zhijun Han 
2012 2013 — 
Wei Qian Lei Tian 
Yuting Wang Lian Deng 
Pei Wu Shuyue Wang 
Yi Xiao Xin Huang 
Ying Zhou Yuchen Wang 
Libo Huang 
Qianhui Yu Yuting Chen 
Tiangen Chang Yang Gao 
Kai Yuan Xixian Ma 
Jinxi Li Denghui Liu 
Qidi Feng Yizhen Yan 
2014 2015 
Chao Zhang Zheng Luo 
Qian Li Xian Xia 
Yaqiang Cao Lei Zhang 
Rui Dong Xiaoran Zhang 
Qin Yang Xiaoji Wang 
Honglong Zhao Qingyu Xiao 
Wangjie Hu Yanan Guo 
Xukai Ma Shiting Chen 
Guoyu Chen Wenshan Gao 
Fudi Wang Benpeng Miao 
Donghong Cai Wenyan Chen 
Quanlong Jiang Jiaojiao Liu 
2016 2017 
Yin Huang Chang Liu 
Bingjie Li Ziqian Hao 
Sijie Wu Liguang Yang 
Dan Li 
Pengyuan Du 
Weifen Sun 
Yang Gao 
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Sep 14-21, 2014,2014 Otto Warburg International Summer School and Research 
Symposium on Comparative and Evolutionary Genomics, Shanghai,China. 


Otto Warburg Summer School 


As part of the IMPRS program, the Otto 
Warburg International Summer School and 
Workshop is held alternatively in Berlin and in 
Shanghai. The first of these jointly organized 
summer schools took place in Berlin in 

2011 and was dedicated to “Evolutionary 
Genomics”. In 2012 it was held in Shanghai 
and focused on “Genes, Metabolism, and 
Systems Modeling’. The topic in 2013 in Berlin 
was “Next Generation Sequencing and its 
impact in genetics”. PICB hosted the summer 
school on Comparative and Evolutionary 
Genomics in Shanghai In 2014. The summer 
school invites high-level researchers, from 


the current fields of computational biology, 
to cover basic concepts and research topics. 
IMPRS students in Shanghai and Berlin as well 
as selected students abroad jointly participate 
in this summer school. 


Research visits 


IMPRS students are encouraged to conduct 


research in Germany for up to 3 months. In 
addition to IMPRS funding, PICB has helped 
students to obtain funds from other sources 
such as the China-Europe Joint Education 
Program and established connections 
between CAS and overseas Institutions for 
students to do research abroad. At the same 
time, PICB encourages individual research 
groups to provide exchange opportunities 
for students according to their research plans. 
From 2014 to 2017, twenty-one students 
attended IMPRS summer school or did 
research for three months. Among them, 

one students went to U.S. to conduct their 
research, one student went to U.K. and the 
other students went to Germany to complete 
their exchange program. 


Honors and Awards 


PICB students have received a variety of 
awards because of outstanding scientific 
research achievements (Appendix II) 
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Appendix II 


Honors and Awards received by PICB Students 


Awards Title Winner Supervisor 
2010 Zhu Li Yue Hua Excellent PhD Student Chunxuan Shao Jun Yan 
2010 Pfizer First level Prize Yuting Liu Jun Yan 
2010 Pfizer Second level Prize Zhongshan Li Philipp Khaitovich 
2010 Di Ao Second level Prize Wenfei Jin Li Jin 
2011 Pfizer Outstanding Prize Wenjin Li Frauke Graeter 
2011 Pfizer First level Prize Wenfei Jin Li Jin 
2011 Pfizer Second level Prize Kao Lin Haipeng Li 
2011 Di Ao Second level Prize Hang Xiao Axel Mosig 
2012 Lily First Level Prize Wenfei Jin Li Jin 
2012 Di Ao Second level Prize Haiyi Lou Li Jin 
2012 Pfizer First level Prize Jun Li Stefan Gruenewald 
2012 Pfizer Second level Prize Junrui Li Haipeng Li 
2012 Pfizer Second level Prize Xinyi Yang Christine Nardini 
2013 CAS 100 Excellent PhD Thesis Wenfei Jin Li Jin 
2013 Di Ao Second level Prize Qingfeng Song Xinguang Zhu 
2014 Paul Biotechnology Prize Yu Wang Xinguang Zhu 
2014 Di Ao Second level Prize Liu He Philipp Khaitovich 
2015 CAS President Outsdanting Award Xiao-ou Zhang Li Yang 
2015 CAS President Award Yi Huang Jingdong Jackie Han 
2015 Zhu Li Yue Hua Excellent PhD Student Weiyang Chen Jingdong Jackie Han 
2015 Ray Wu Priz Xiao-ou Zhang Li Yang 
2015 BHPB Prize Xiao-ou Zhang Li Yang 
2015 Di Ao Second level Prize Yuning Wei Philipp Khaitovich 
2016 Zhu Li Yue Hua Excellent PhD Student Rui Dong Li Yang 
2016 Di Ao Second level Prize Qian Li Philipp Khaitovich 
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Alumni Career 


By March 2017, 76 students have graduated 
successfully from PICB. One third graduates 
were accepted as postdocs by overseas 

top universities and research institutes. And 
another one third graduates joined first-class 
domestic scientific institutions to continue 
their research and others have worked with 
big companies. 


Institute always try to invite prominent alumni 
to come back to have a talk face to face and 
share with students about their experiences 
and challenges. In 2017, Dr. Wenfei Jin who 
received his Ph.D. in PICB with Prof.Li Jin 

and Prof. Shuhua Xu in 2012, was selected 

to “National young talents program of 

China” and joined in Southern University of 
Science and Technology (SUSTech) in 2017, 
were invited to come back to visit PICB. 

The exchange attracted a large number of 
students and gave them not only lots of 
wonderful and valuable experiences, but also 
the of great help guidance. 
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6.2 Guest Scientists 
Visiting Scientists 


Guest Scientists 
Duration of stay longer than one month 


Ivanov Peter Lomonosov Moscow State University Bulgaria o 3=May 
Flavel Matt Latrobe University Australia Jan-Mar 2014 
Banes Graham PI for Evolutionary Anthropology UK Jan-Mar 2014 
Shmelkova | Anna a Aciiviy ane Russia Jan-Jul 2014 

Long Stephen University of Illinoisat Urbana-Champaign USA Feb - Apr 2014 
e lien a Institute for Environmental Gaimeny Mer izo 
Hahn Oliver Ruprecht-Karls-University of Heidelberg Switzerland |Jun-Nov 2014 

Liu Jian Fujian Agric. Univ. China Jan-Jun 2014 
Verslype Antoine University of Louvain Nederland | Sep-Nov 2014 
Long Stephen University of Illinoisat Urbana-Champaign USA Oct - Nov 2014 
Giavalisco | Normen a tot molecular plant Germany Sep-Nov 2014 
Hellmann Ines Ludwig Maximilian University of Munich Germany Sep-Oct 2014 
Bozek Katarzyna ax-Planck-Institute for Evolutionary Poland Sep-Dec 2014 
Wang Jiawei SIPPE, SIBS, CAS China Nov 2013 -Nov2015 
Larraz Patricia Universidad de Barcelona Spain Apr 2015 - Feb 2016 
Metzger Karoline Ruhr-Universitaet Bochum Germany Jan-Mar 2015 
Shmelkova | Anna Skolkovo Institute of Science and Technology| Russia Feb-Oct 2016 
Larraz Patricia UNIVERSIDAD COMPLUTENSE MADRID Spain Mar-Agu 2015 
Bozek Katarzyna ax-Planck-Institute for Evolutionary Poland Apr-Sep 2015 
Ventura Vittor Syracuse University USA May-Oct 2015 
Ruiz-Linares | Andres University College London UK May-Jan 2015 
Larraz Patricia Universidad de Barcelona Spain Sep-Dec 2015 
Zhang YinQi UT-Dallas China Sep-Nov 2015 
Ventura Vittorio Syracuse University USA Nov-Dec 2015 
Song Xiaowei The Second Military Medical University China Dec 2016-Feb 2017 
Mao Miaowei Fe pl oao China Dec 2016-Feb 2017 
Guest Scientists 

Duration of stay up to one month 

Bennett David A Rush University Medical Center USA 2014 

Long Stephen University of Illinoisat Urbana-Champaign USA 2014 

An Gynheung | Puhang University Korea 2014 

Jeon Jong-Seong | Kyung Hee University Korea 2014 

Long Stephen University of Illinoisat Urbana-Champaign USA 2014 

Chow Wah Soon | Australian National University China 2014 
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Irina Garanina Lomonosov Moscow State University Russia 2014 
Petr Kolosov a Use eh ela Russia 2014 
Maria Lemak ee Or Actii ang Russia 2014 
Seneking Mens a a for Evolutionary E 2014 
Huang Hailiang Broad Institute of MIT and Harvard China 2014 
Lässig Michael University of Cologne Germany 2014 
Liu Xiaole Shirley| Dana-Farber Cancer Institute USA 2014 
Bennett David A Rush University Medical Center USA 2014 
Huls Anke Leibniz Research Institute for Environmental Getimiahy 2014 
edicine 
Shaa eres pee Institute for Environmental Gameny 2014 
Zhong Sheng Department of Bioengineering, UCSD USA 2015 
Zhou Yu University of California, San Diego USA 2015 
Irie Naoki University of Tokyo Japan 2015 
Mikhail Ugryumov | Lomonosov Moscow State University Belarus 2015 
Konstantinov | Andrey Skolkovo institute of technology Russia 2015 
Dobryukha | Anna Skolkovo institute of technology Russia 2015 
Asadova Nargiz Skolkovo institute of technology Russia 2015 
Boyko Liliya Skolkovo institute of technology Russia 2015 
Baklanov Mikhail Skolkovo institute of technology Russia 2015 
Efimova Olga Skolkovo Institute of Science and echnology | Russia 2015 
Rivera France University of Cambridge USA 2015 
Patrush Maksim Immanuel Kant Baltic Federal University Russia 2015 
Zheng Siyuan I ndi of Texas MD Anderson Cancer USA 2015 
Zhu Xiaofeng Case Western Reserve University China 2015 
Li Jun University of Michigan USA 2015 
Efimova Olga Skolkovo Institute of Science and Technology | Russia 2016 
kharchenka: Peter Se gad nformatics in Harvard Medical USA 2016 
Treutlein Barbara peice iad for Eveluieneiy Germany 2016 
Turck Chris ax Planck Institute of Psychiatry Germany 2016 
Ivanova Violetta Russian Academy of Sciences Russia 2016 
Aseev Nikolai Russian Academy of Sciences Russia 2016 
Efimova Olga Skolkovo Institute of Science and Technology | Russia 2016 
Brenner Steven University of California Berkeley USA 2016 
Wenk Markus ational University of Singapore Switzerland |2016 
Lin Kaiqi Michelle ational University of Singapore Singapore |2016 
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anne Amaury ational University of Singapore French 2016 
Chiang Charleston | University of California USA 2015 
Jin Wenfei ational Institutes of Health China 2016 
nstitute of Vertebrate Paleontology and 
Fu Qiaomei Paleoanthropology, Chinese Academy of China 2016 
Sciences 
Liu Xiaoming University of Texas USA 2016 
Saitou Naruya ational Institute of Genetics Japan 2016 
Ye Kai Xi'an Jiaotong University China 2016 
Wang Jiayin University of Connecticut China 2016 
Wu Yufeng University of Connecticut China 2016 
Shikowski lTamara Leibniz Research Institute for Environmental Germany 2016 
edicine 
Naruya Saitou ational Institute of Genetics,NIG Japan 2016 
Fan Jianglin Yamanashi University, Japan Japan 2016 
Parry Martin University of Lancaster Germany 2016 
Lucie [Madan | Deparonem of Molecular Structural Biology [Se [2017 
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7 Collaboration 


7.1 Joint Appointments 


Jin Li Full Professor at Fudan University, Shanghai, China Oct 2005 
Gerwert Klaus Chair of Biophysics, Faculty of Biology, Ruhr-University Bochum 1993 
Wang Zefeng Guest Professor at Shanghai Tech University 2015 
Han Jingdong Jackie | Guest Professor at Shanghai Tech University 2013 
Khaitovich | Philipp Guest Professor at Shanghai Tech University 2013 
Xu Shuhua Guest Professor at Shanghai Tech University 2013 


7.2 Cooperation 


Partner Institutes 


Institute Title/Topics 


Max Planck Institute for Molecular Systems parnenneitute 2010 
Biology 
Max Planck Institute for Evolutionary nE EITE 2006 
Anthropology 
Max Planck Institute of Molecular Plant Patthar istitut 2008 
Physiology 
Ruhr-University Bochum Partner Institute 2008 
Cooperation in China 
Institute Title/Topics Since 
Dalian Medical University RNA Splicing 2016 
Xinjiang Medical University Population Admixture in Xinjiang 2012 
Xinjiang University Population Admixture in Xinjiang 2012 
Jnana Universiiy Many collaborative projects on human 2012 
population genetics 
Ningxia Medical University Population Admixture of Hui People 2012 
Fude Universióy Many collaborative projects on human 201 
population genetics 
Institute of Biodiversity Science and Geobiology, | High Altitude Adaptation of Tibetan 201 
Tibet University, China populations 
Cn Nano! DE manic tome enter Evolution of hepatitis B virus (HBV) 201 
Shanghai, China 
Department of Mathematics, The Chinese Mathematical modeling of human admixture 201 
University of Hong Kong dynamics 
"973" project- The molecular regulatory 
WuHan University networks of malignant tumor development, 2011 
invasion and metastasis. 
Institute of Biochemistry and Cell Biology, SIBS, | “CAS leading project”- Regulatory networks 201 
CAS of stem cell maintenance and differentiation 
"863" project-Major disease-related 
Key sos of Systane Bioleay, SES, CAS systems for biomedical research and its 2012 


computational biology technology platform 
building 
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Beijing Center for Diseases Prevention and 
Control 


Institute of Zoology, CAS 


PICB, iHuman, ShanghaiTech University,SIMM 


Lab of Sensory Integration and Behavior, 
Institute for Neuroscience, Institutes for Biological 
Sciences 


Institute for Biomedical Sciences, Fudan 
University 


State Key Laboratory of Hybrid Rice 
Shandong Normal University 
Hunan Agriculture University 


nstitute of Biochemistry and Cell Biology, SIBS, 
CAS 


nstitute of Neuroscience, SIBS, CAS 
Kunming Institute of Zoology 
nstitute of Biochemistry and Cell Biology 


s 
B 

nstitute of Biochemistry and Cell Biology 
B 


nstitute of Biochemistry and Cell Biology 


Kunming Institute of Zoology 


Institute of Biochemistry and Cell Biology, SIBS, 
CAS 


Institute of Biochemistry and Cell Biology, SIBS, 
CAS 


Zhongshan hospital, Fudan University 


Zhongshan hospital, Fudan University 


Shanghai Jiao Tong University 


Fudan University 


Huazhong university of science and technology 


University of Science and Technology of China 


The Academy of Military Medical Sciences 


Application of nematode C.elegans in high 
throughput toxicity test 


"973" project- Stem cell and aging key 


scientific 


research program 


GPCR structure and function 


Cell Tracking for Life Cell Imaging Data 


sy 


> 


Proteomi 


Evolution of transposable elements in primates 


Adjuct Professor 
onorary Professor 
djuct Professor 


cs research 


euroscience research 
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Institute Title/Topics 


2012 


2015 


2014 


2014 


2014 
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2013 
2015 


2006 


2006 
2010 
2012 
2013 
2013 


1998 


2012 


2015 


2014 


2015 


2011 


2011 


2011 


2011 


2011 


COLLABORATION 


Since 


Institute Title/Topics 


“CAS leading project”- Personalized 


Shanghai institute of materia medica stratification of antitumor drugs 2015 
Tianjin Institute of Industrial Biotechnology crue en a cone 2012 
nstitute of Biochemistry and Cell Biology, CAS Epigenetic control of cell trans-differentiation 2014 
Peking University Epigenetic regulation in stem cells 2014 
nstitute of Biophysics Epigenetic regulation in stem cell aging 2015 
nstitut Pasteur of Shanghai, CAS Higher-order chromatin structure of Malaria 2015 
nstitute of Biochemistry and Cell Biology, CAS a regulation ey Foyecine 2013 
Wuhan University Epigenetic dysregulation in cancer 2013 
Ruijin Hospital Epigenetic dysregulation in cancer 2014 
Fudan University Genetic analysis of facial morphology 2014 
Fudan University Taizhou Health and Sciences Study on the characteristics of physical 2014 
nstitutes anthropology in essential hypertension 

The Institute of Health Sciences, SIBS, CAS Functional validation of GWAS signals 2015 
Shanghai Children’s Medical Center Fingerprint patterns in Leukemia 2016 
Huashan Hospital The genetic study of craniosynostosis 2016 


The evolution and genetic mechanisms of 
igmentation related phenotypes in East 2017 


Beijing Institute of Genomics, CAS p 
Asian and European populations 


The evolution and genetic mechanisms of 
pigmentation related phenotypes in East 2017 
Asian and European populations 
U 


Kunming University of Science and Technology 


Shanghai Jiaotong University nrooted triples in phylogenetic trees 2014 

University of Science and Technology of China | Almost compatible split and set systems 2014 

Institute of Plant Physiology and Ecology Evolution of human pathogens 2011 

Southern Medical University, Guangzhou Evolution of human pathogens 2011 

Obstetrics & Gynecology Hospital, Shanghai Genetic basis for birth defects 2015 

Beijing University of Posts and Advanced Machine Learning Methods for 2013 

Telecommunication ntegrative Epigenomics 

Zhongshan Hospital, Shanghai etwork Biomarkers for Acute Aortic 2013 
Dissection 

Key State Lab of Systems Biology, Shanghai eG Deer nase Tension Merles 2013 
Cancer 

Tianjin University, Beijing Using DNA methylation to predict cancer- 2015 
ype of unknown primary 

Key Lab of Intelligent Information Processing, 

Institute for Computing Technology, Beijing Database of differential expressed miRNAs in 2015 

Peking Union Medical College Hospital, Chinese | human cancers 

Academy of Medical Sciences, Beijing 

Tongji University Brain fMRI related gene analysis 2016 

Shanghai Institute of Biochemistry and Cell Hieno peine anels 2016 
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Cooperation in Germany 
Institute 


Max Planck Institute for Biophysical Chemistry 
Max-Planck Institute of Evolutionary 
Anthropology 


Faculty of Technology, Universitat Bielefeld, 
Bielefeld, Germany 


ax Planck Institute for Evolutionary Biology 
University of Cologne 


nstitute of Epidemiology Il, Helmholtz Zentrum 
unchen 


ax Planck Institute for Evolutionary 
Anthropology 


Max Planck Institute of Molecular Plant 
Physiology 


Tuebingen University 


Department of Biophysics, Ruhr University 
Bochum (Eckhard Hofmann) 


nterdisciplinary Center for Bioinformatics, 
University of Leipzig (Peter F. Stadler) 


nstitute for Computer Science, University of 
Bonn (Michael Clausen) 


ax Planck Institute of Molecular Plant 
Physiology 


Heinrich Heine University Düsseldorf 
Max Planck Institute of Molecular Plant 


Physiology 


Title/Topics 


CAS President's International Fellowship 
Initiative 


Human Migration History in Oceania 


Development of method for genome 
structure analysis 


Recent Darwinian adaptation in mice 
Theoretical population genetics 


Development of a metabolite-based 
diagnostic algorithm to identify pre-type 2 
diabetes in Asian and Caucasian populations 


Human evolution 


Metabolome / Lipidome studies 
Data analysis algorithms 


High-performance Computing cluster 


Evolutionary Patterns of non-coding RNA 


Signal Processing Methods for Analyzing In 
Vivo Flow Cytometer Data, 


Natural variations of photosynthesis 


Elements controlling C4 photosynthesis 


Natural variations of photosynthesis 


Special Cooperation Programs and Multi-Lateral Cooperation 


Institute Title/Topics 


Genome Institute of Singapore, Singapore 
The Catholic University of Korea 


EPFL 

EPIL 

AARUS 

St George University 
CNR 
nternational Rice Research Institute 
University of Illinois at Urbana Champaign 
Sheffie 
University of Dublin 


d University 


Hebrew University 


University College London, The Queensland 
nstitute of Medical Research, Erasmus University 


edical Center, King’s College London 


Pan-Asian SNP Project-Phase II 


Genetic Diversity of East Asians and Its 
Evolutionary/Phenotypic Implications 


Time and temperature nanosensor 
i-needle development 

RNA-seq 

Petri Network Simulations 
Rheumatoid Arthritis Modeling 

C4 rice project 


RIPE project 

3to4 project 
Grassmargin project 
NSFC-ISF Research 


Overview of International Visible Traits 
Genetics(Visigen) consortium research 


2013 


2012 


2011 
2013 
2011 
2011 
2011 
2009 
2012 
2012 
2011 
2017 


2016 
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Further International Collaboration 


Institute Title/Topics 


ational University of Singapore Asian Genetic Diversity 2013 
nstitute of Biological Sciences, Faculty of Genome-wide association study of type-2 2011 
Science Building, University of Malaya, Malaysia | diabetes in Malay populations 
nstitute of Medical Molecular Biotechnology 
(IMMB), Faculty of Medicine, Universiti Teknologi | Population genomics of Negritos in Malaysia 2011 
ARA, Sungai Buloh Campus, Selangor, Malaysia 
Jeffrey Cheah School of Medicine and Health : aA A shale l 
Sciences, Monash University Population genomics of Negritos in Malaysia 2011 
Institute of Genomics & Integrative Biology (IGIB),| Copy number variations and population 2011 
New Delhi, India structure in Indian populations 
Integrated Research Center for Genome 
Polymorphism (IRCGP), The Catholic University | Genetic diversity of East Asian populations 2011 
of Korea, Seoul, Korea 
achine Learning Department & Language 
Technology Institute & Computer Science Development of methods for detecting 2011 
Department, School of Computer Science, population structure 
Carnegie Mellon University, USA 
Electrical Engineering and Computer Science 
Department, Center for Proteomics & Identity by descent analysis of human 2011 
Bioinformatics Comprehensive Cancer Center, population data 
Case Western Reserve University, USA 
BBS, schoo) of Medi See neea Universit Population structure and history in Malaysia 2010 
Sains Malaysia, Malaysia 
Network discovery and compare of histone 
Alzheimer Disease Center of Rush University modlimcaiions regulere gene TPE EON 2011 
the rhesus monkey and human brain aging 
process 
london Research institute A ternative spliced events during dietary/ 2012 
exercise intervention 
Risk Prediction and Early Detection of 
University College London Women specific Cancers (FORECEE 2015 
HORIZON2020) 
ar Blind Source Separation tools for cell-type 
University College London decsnvolution im EWAS 2015 
King’s College London etwork Entropy in Systems Biology 2013 
University of Cambridge/ CRUK Cambridge dentification of cancer stem cells in ovarian 2015 
nstitute cancer 
University of Cambridge/ CRUK Cambridge Signaling entropy as a prognostic and 2015 
nstitute predictor biomarker in cancer 
John Hopkins University, Baltimore, USA Sources of DNA methylation variability 2016 
Max-Planck Institute for Informatics, Systems Epigenomics methods to identify 2016 
Saarbruecken, Germany disrupted regulatory networks in cancer 
UCLA, Los Angeles, USA DNA methylation age and menopause 2016 


277 


COLLABORATION 


278 


Bristol University, UK 


RC/NSHD, UCL, London, UK 


nternational Human Epigenome Consortium + 
Blueprint 


Helmholtz Institute for Biomedical engineering, 
Aachen, Germany 


University College London 


King’s College London 


The University of Edinburgh 


University of Wisconsin 


nstitute for Theoretical Chemistry and Structural 
Biology, University of Vienna 


Departement of Computer Science, University of 
issouri-Columbia, USA 


University of Canterbury, Christchurch, New 
Zealand 


University of East Anglia, Norwich, UK 


University of Georgia, US 


University of East Anglia, UK 


Identifying causal DNA methylation 
alterations in smoking associated lung 
cancer 


Cancer Risk Prediction 


DNA methylation and type-1 diabetes / 
Epigenomic tools for cell-type deconvolution 
analyses 


Study of IncRNA HOTAIR in human cancer 


The research of genetics of facial features 
and dental morphology and human skin 
diversity 


Gene-environment interaction on skin aging 


Functional validation for fingerprint pattern 
gene 


QM/MM simulations on proteins 


Evolutionary Patterns of non-coding RNA 


Cell Tracking for Life Cell Imaging Data 


Phylogenetics 

Reconstructing phylogenetic trees from 
partial distances 

Missing data in statistical analysis 
Mathematics for phylogenetics 


20 


20 


201 


20 


20 


20 


20 


20 


20 


20 


20 


20 


20 
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Institute Title/Topics 
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8 Events Organized by the Institute 


8.1 Symposia and Conferences 


Nov.5-6, 2015, Scientific Symposium on‘Retrospect and Prospect of Computational Biology 


Research”, Shanghai,China. 


This symposium, titled “Retrospect and Prospect 
of Computational Biology Research", aimed to 
celebrate the 10th anniversary of the PICB 
with a scientific gathering. During this event, 
there were 20 talks given by experts with 
diverse topics covered the novel development 
and application of computational 


biology approaches in molecular biology, 
neuroscience, evolutionary biology, 
microbiology, plant sciences and other related 
areas. The symposium sufficiently reflected 
the cutting-edge and interdisciplinary features 
of computational biology. 


July 20-22, 2016, Interdisciplinary Forum: New biology---- From Big Data to Novel Mechanism, Shanghai,China. 
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This interdisciplinary forum on biomedical big 
data was mainly focused on three aspects: 
the advances of big data in biomedicine, 
the methods, mechanisms and standards for 
biomedical big data sharing and application, 
and the perspectives in biomedical big data 
and cultivation of interdisciplinary talents. 
The experts from China, Germany and 

other countries extensively discussed new 
advances and forthcoming directions of 
biomedical big data, and had a brainstorm 


on the possible ways to cultivate younger 
generation interdisciplinary talents. The 
innovative application of biomedicine big data 
has become the critical area of international 
competition. The experts generally agreed 
that the construction of a centralized big 


data infrastructure is crucial to the success of 
biomedical data management and application, 
which in turn will serve to the urgent need of 
modern biomedical system in China. 


July 25-28, 2016 Sino-German RNA Symposium: 2016RNA Biology and Human 
Disease:From Molecular Mechanisms to Global Networks, Rauischholzhausen, 
Germany. 


This symposium is supported by the Sino- 
German Center for Research Promotion and 
was organized by Prof. Dr. Albrecht Bindereif 
Uustus-Liebig-University of Giessen, Germany) 
and Prof. Dr. Zefeng Wang (Max-Planck-Partner 
Institute for Computational Biology, Shanghai, 
China). This symposium has cultivated and 
extended Sino-German personal friendship 
among RNA-biology scientists, thereby 
linking the Chinese and German RNA 

biology communities, with a special focus 


on non-coding RNAs and human disease. By 
bringing together experts from two closely 
related disciplines, RNA biology and the RNA 
bioinformatics, the symposium provided a 
forum to discuss the current state of the art, 
open questions, and the newly developing 
perspectives in this emerging field. It has also 
initiated and promoted Chinese-German 
collaborations across these disciplines, both 


on the bilateral level and by creating a larger 


grant network initiative. 
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ABRs HASMI 
Computational Biology for Bìg Data 
New Opportunities and Challenges 


H 


Mar 23-24,2017, Computational Biology for Big Data: New Opportunities and Challenges,Shanghai,China. 


This symposium provided a forum to discuss 
the new opportunity and technical challenge 
of bio-medical big data management, with 
special attention to the analysis of single cell 
sequencing data. The symposium had three 
sessions, focusing on new techniques for big 


WELCOME TO 


Otte Warburg imennational Sumer 


f School 


and Research Syenpasium 2 


RNA regulation and 
non-coding RNA function 


wetted Spestert 


A 


data analysis (such as deep learning), big- 
data sharing and database management, 

and new challenges and opportunities faced 
with single cell sequencing and integration of 
multi-omic data. 


Agu 14-17,2017,0tto Warburg International Summer School and Research Symposium on RNA 
Biology: Diverse Functions Revealed by Global Analysis, Shanghai,China. 


2017 Otto Warburg International Summer 
School and Research Symposium on RNA 
Biology: Diverse Functions Revealed by 
Global Analysis will be held from August 
14" to August 18°" in Shanghai. The aim of 
this program is to bring together researchers 


and students from different backgrounds 
(including genetics, molecular biology, and 
bioinformatics) to discuss the current research 
highlights. This year’s program is organized by 
Pro. Dr.Zefeng Wang, Pro. Dr. Li Yang and Pro. 
Dr. Zhen Shao. 
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8.2 Seminars 


Mar 2014 
Seminar Talk Speakers: 


+ Yijun Ruan, Jackson Laboratory for Genomic Medicine ,Farmington, Conn,Mapping the 3D 
genome. 


+ Ansgar Poetsch, Ruhr University, Bochum, “Dynomics” — analytical approaches for dynamic 
biological processes on the —omics level. 


+ Christoph Dieterich, Max Planck Institute for Biology of Ageing, Systemic approaches to study 
post-transcriptional networks. 


Institute Seminar Speakers: 


« Zhen Shao, Group of Comparative Biology, PICB, Shanghai, China, MAnorm model for 
quantitative comparison of ChIP-Seq data sets and its downstream applications. 


Apr 2014 
Seminar Talk Speakers: 


+ Alexandros Pertsinidis, Memorial Sloan-Kettering Cancer Center, New York, USA, Visualizing 
complex biological assemblies and interactions with molecular precision using fluorescence 
imaging. 


« Gene W. Yeo, Dept. of Cellular and Molecular Medicine, School of Medicine, UCSD, Insights into 
RNA processing by genome-wide, large-scale experiments. 


+ Clare Garvey, Editor from BioMed Central, Biomedical thesis writing and publishing. 
Institute Seminar Speakers: 
+ Klaus Gerwert, Ruhr University Bochum, Department of Biophysics, University, Bochum, 


Germany; PICB, Shanghai, China, Marker-free imaging of colon cancer at different scales: 
Proteins, membranes, cells and tissue. 


May 2014 


May 22, 2014 
Xu Guangqi Lecture Series :Dr.Robert Huber 


Nobel Prize in Chemistry 1988, Max-Planck-Institute for Biochemistry, 
Beauty and Fitness for Purpose: The Architecture of Proteins,the Building Blocks of Life. 


Seminar Talk Speakers: 


+ Gang Fang, Yale University, Genes persistent in extant species through 3.5 billion years of organ- 
ism speciation. 


+ Zhanyu Ma, Beijing University, Non-Gaussian Statistical Models and Their Applications in DNA 
Methylation Analysis. 
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+ S. Cenk Sahinalp, Indiana University, Bloomington, IN,USA, and Canada Research Chair, Simon 
Fraser University, Burnaby, BC,Canada, Computational methods for clonal evolution and tumor 
subtyping. 


+ Alisdair Fernie, Max-Planck-Institute for Molecular Plant Physiology, Metabolomics assisted 
breeding. 


+ Zoran Nikoloski, Max-Planck-Institute for Molecular Plant Physiology,Data-driven 
constraint-based modeling of plant metabolism. 


Jul 2014 
Seminar Talk Speakers: 


+ David A Bennett, Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, 
Building a Pipeline to Identify and Validate Novel Therapeutic Targets for Neurodegenerative 
Disease. 


+ Lana Garmire ,University of Hawaii Cancer Centre, Towards solving the challenges of big cancer 
data integration. 


+ Min Chen, School of Biological Sciences, University of Sydney, Australia, Novel chlorophylls and 
new directions in photosynthesis research. 


Aug 2014 
Seminar Talk Speakers: 


+ Yunde Zhao, UCSD, Next generation CRISPR technology. 


+ Jiang Qian, Johns Hopkins University, Global analysis of gene regulation and signaling 
networks. 


Sep 2014 
Seminar Talk Speakers: 


+ Yiwen Chen, Department of Biostatistics & Computational Biology, Dana Farber Cancer 
Institute, Harvard School of Public Health, Integrative approaches for decoding the function 
of long non-coding RNAs in human Cancer. 


+ Andres Ruiz-Linares, Human Genetics in the Department of Genetics, Evolution and 
Environment at UCL, Admixture in Latin America: Geographic Structure, Phenotypic Diversity 
and Self-Perception of Ancestry Based on 7,342 Individuals - The CANDELA Study. 


Institute Seminar Speakers: 


+ Xiaoou Zhang, group of Computational Transcriptomics and Bioinformatics,PICB, Shanghai, 
China, Complementary Sequence Mediated Exon Circularization. 
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+ Yungang He, group of Computational Genetics, PICB, Shanghai, China, Detecting Recent 
Positive Selection with High Accuracy and Reliability by Conditional Coalescent Tree. 


+ Xiaole Shirley Liu, Harvard Medical School, Dana-Farber Cancer Institute, Computational 
Modeling of Cancer Gene Regulation. 


Oct 2014 
Seminar Talk Speakers: 


+ HARE Brian, group of Evolutionary Anthropology, Duke University, The evolution of human 
nature: from dogs to bonobos to people. 


+ Nicholas S. Foulkes, Institute of Toxicology and Genetics, Karlsruhe Institute of Technology / 
Ruprecht-Karls-Universitat Heidelberg, The circadian clock: New lessons from fish. 


+ Gynheung An, Crop Biotech Institute, Kyung Hee University, Epigenetic regulations of flowering 
time and biomass in rice. 


+ Xiaowei Wang, Department of Radiation Oncology and Department of Biomedical 
Engineering, Washington University School of Medicine, Computational biology in microRNA 
research. 


+ Yong Yu, Huffington Center on Aging and Department of Molecular and Human Genetics, 
Baylor College of Medicine, Shedding new light on LIPIDOLOGY with chemical imaging 
methodologies and genetic screens. 


Student Seminar Speakers: 


+ Gangcai Xie, Group of Comparative Biology, PICB, Shanghai, China, Highly expression of 
HERVHs defines the long sought human naive-like stem cells. 


+ Feng Gao, Group of Evolutionary Genomics, PICB, Shanghai, China, FastEPRR: a Fast Estimator 
for the Population Recombination Rate. 


Nov 2014 
Seminar Talk Speakers: 


+ Yun Liu, School of Basic Medical Sciences, Fudan University, DNA methylation in human 
disease. 


+ Michael Ldssig, University of Cologne ,Germany, Evolutionary dynamics far from equilibrium. 


+ Wei Chen, Berlin Institute for Medical Systems Biology, Max-Delbrueck-Center for Molecular 
Medicine, Post-transcriptional gene regulation and deregulation in human diseases. 


+ Simone de Jong , KCL, London, Seasonal changes in gene expression represent cell-type 
composition in whole blood. 
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Institute Seminar Speakers: 


+ Kun Tang,group of Functional Human Genetic Variation , PICB, Shanghai, China, Temporal 
resolution of human genetic adaptation after modern human emergence, a study combining 
a coalescent model based selection test and ancient DNA data 


+ Boon Peng,group of Population Genomics, PICB, Shanghai, China, Population diversity and 
genetic architecture of multi-ethnic groups in Peninsular Malaysia 


Dec 2014 
Seminar Talk Speakers: 


+ Jun Wang, Rega Institute KU Leuven/VIB Center for the Biology of Disease, Microbial Ecology at 
the Frontier of Host Ecology, Evolution and Health. 


+ Colin Collins, University of British Columbia and Senior Scientist Vancouver Prostate Centre, 
Mechanistic Insights To Therapeutic Resistance In Prostate Cancer. 


Jan 2015 
Seminar Talk Speakers: 


+ Jun Li, Human Genetics and Research Associate Professor of Computational Biology and 
Bioinformatics of U-M, Classification problems in genomics. 


+ Yu Zhou, University of California(San Diego), Genome-wide analysis of RNA-protein 
interactions in regulated transcription and splicing 


Institute Seminar Speakers: 


+ Yi Huang, group of Molecular Systems Biology, PICB, Shanghai, China, Single-cell level spatial 
gene expression in the embryonic neural differentiation niche. 


+ Gang Wei, group of Epigenome Biology, PICB, Shanghai, China, Epigenetic regulation in 
development and disease. 


Apr 2015 
Seminar Talk Speakers: 


+ Kaifu Chen, Baylor College of Medicine, Houston, Broad H3k4me3: an epigenetic approach to 
mutation-independent cancer gene discovery. 


Institute Seminar Speakers: 


+ Stefan Grünewald, group of Phylogenetic Combinatorics, PICB, Shanghai, China, Estimating 
rate inhomogeneity using Hadamard conjugation. 
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+ Sijia Wang, group of Dermatogenomics, PICB, Shanghai, China, Environmental and genetic 
factors for skin aging. 


May 2015 
Seminar Talk Speakers: 


+ Chunbo Lou, CAS Key Laboratory of Microbial Physiological and Metabolic Engineering, 
Institute of Microbiology, Chinese Academy of Sciences, Beijing, Insulated design of promoters 
and operators for precisely manufacturing genetic circuits. 


+ Charleston Chiang, Department of Ecology and Evolutionary Biology, University of California, 
Population genetic insight to the study of human height. 


+ Christian Hermans, Université Libre de Bruxelles, lab. Plant Physiology and Molecular Genetics, 
Belgium, Nitrogen Influence on Lateral Root Development in Plants. 


+ Jian Xu, Children’s Research Institute, UT Southwestern Medical Center, Dallas, Role of 
Non-Coding Regulatory Genome in Stem Cells and Cancer. 


+ Shisong Ma, Department of Plant Biology and the Genome Center, University of California, 
Davis, From big data to biological knowledge - using gene network and protein microarray to 
study signaling systems in Arabidopsis, rice, and human. 


+ Ji-Long Liu, Programme Leader MRC Functional Genomics Unit Department of Physiology, 
Anatomy and Genetics University of Oxford, The cytoophidium and its kinds: Filamentation 
and compartmentation of metabolic enzymes. 


+ Michael V. Ugrumov, Ministry of Education and Science of the Russian Federation, President of 
the Russian Academy of Sciences, Head of the Laboratory of Nervous and Neurodendocrine 
Regulations, Institute of Developmental Biology RAS. The role of the Brain in the intergration of 
the whole organism. 


Jun 2015 
Seminar Talk Speakers: 


+ Terry Speed, Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia , 
Instrumental variables and negative controls. 


Jul 2015 
Seminar Talk Speakers: 


+ Wenyi Wang, Tenure Track Assistant Professor at MD Anderson Cancer Center, USA, 
DeMix-Bayes: A Bayesian Model for the Deconvolution of Mixed Cancer Transcriptomes. 


+ Sheng Zhong, Department of Bioengineering, UCSD, Updates on RNA-RNA interactions. 


+ Xiaofeng Zhu, Case Western Reserve University, Genome-Wide Survey in African Americans 
Demonstrates Epistasis of Fitness in the Human Genome. 
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Sep 2015 
Institute Seminar Speakers: 


+ Zhen Shao, Group of Regulatory and Systems Genomics, PICB, Shanghai, China, Computational 
analyses of multiple omics’ datasets and sequence features. 


+ Yi Xiao,Group of Plant System Biology, PICB, Shanghai, China, A mechanistic analysis of 
mesophyll conductance using a three-dimensional model of leaf photosynthesis and CO2 
diffusion. 


Oct 2015 
Seminar Talk Speakers: 


+ }N Dali Han, Department of Chemistry and Institute for Biophysical Dynamics, The University of 
Chicago. Decoding new players in gene regulation game: novel modifications in DNA and RNA. 


+ Chengqi Yi, Peking University, Sequencing Nucleotide Modifications of Epigenetic Significance. 


+ Anthony V. Furano, National Institute of Diabetes and Digestive and Kidney Diseases, National 
Institutes of Health, L1 Retrotransposons: Shapers and Historians of Mammalian Genomes. 


Institute Seminar Speakers: 


+ Zefeng Wang,Group of RNA System Biology, Systematic regulation and mis-regulation of 
alternative splicing in cancer. 


Nov 2015 
Seminar Talk Speakers: 


+ Pedro Ballester ,Cancer Research Center of Marseille, France, Unearthing new genomic markers 
of drug response by improved measurement of discriminative power. 


Institute Seminar Speakers: 


+ Yungang He, Group of Computational Genetics, PICB, Shanghai, China, Identifying and 
Measuring Selection Difference among Populations. 


» Zhen Yang, Group of Computational Systems Genomics, PICB, Shanghai, China,. A Smoking-as- 
sociated EWAS study of buccal cells and a pan-cancer wide study of epigenetic enzymes related 
to DNA methylation deregulation in cancer. 


Dec 2015 


Dec 16, 2015 

Xu Guangqi Lecture Series : Dr. Hartmut Michel 
Director, Max Planck Institute of Biophysics, 

The nonsense of biofuels 
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Seminar Talk Speakers: 


+ Ben Raphael, Brown University, Computational Characterization of Mutational Heterogeneity 
in Cancer. 


+ Guang-Zhong Wang, Department of Neuroscience, UT Southwestern Medical Center, 
Decoding the Transcriptome of Behavior: from Yeast to Human Brain. 


+ Can Xie, Peking University, Homing Instinct: Signal from the Earth--The Molecular Mechanism 
of Magnetoreception and Animal Migration. 


Institute Seminar Speakers: 


+ Xuan Cao, Group of Epigenome Biology, PICB, Shanghai, China, Mapping chromatin architecture 
in Plasmodium falciparum reveals PfSETvs reshapes var genes chromatin structure. 


+ Rui Dong, Group of Computational Transcriptomics and Bioinformatics, PICB, Shanghai, China, 
Genome-wide gene expression profiling of haploid embryonic stem cells reveals molecular 
mechanism to the increased efficiency of genetically modified mice generation. 


Jan 2016 
Seminar Talk Speakers: 


+ Xiaoming Liu, Human Genetics Center, University of Texas School of Public Health, Exploring 
Demographic Histories Using SNP Frequency Spectrums. 


+ Qian Li, PICB, Shanghai, China,Lipidome composition change during postnatal development 
in primate brains. 


+ Ying Zhou, PICB, Shanghai, China, Inference of admixture history with admixture introduced 
Linkage Disequilibrium 


Feb 2016 
Seminar Talk Speakers: 


+ Saitou Naruya, Division of Population Genetics, National Institute of Genetics Mishima, Japan, 
First people who reached Sundaland during ice age. 


« Fan Liu, Beijing Institute of Genomics, CAS, The MC1R Gene and Youthful Looks. 


Mar 2016 
Institute Seminar Speakers: 


+ Guangyong Zheng, Group of Plant System Biology, PICB, Shanghai, China,Gene regulatory 
network. 


+ Zhen Shao, Group of Regulatory and Systems Genomics, PICB, Shanghai, China,PRC2-binding 
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long non-coding RNAs can be accurately predicted by their sequence composition. 


Apr 2016 


Colin Collins, Director of the Vancouver Prostate Cancer Centre, Insights To Treatment Resistant 
Prostate Cancer. 


+ Steven E. Brenner, University of California, Berkeley, RNA REGULATION Splicing & nonsense-mediated 
mRNA decay Pharmacotranscriptomics. 


+ Jun Zhu, Zhejiang University, Association Mapping for Complex Diseases and Precision Molecular 
Medicine. 


+ Steven E. Brenner, University of California, Berkeley, Interpreting Newborn Genomes. 


+ Wu Wei, Stanford Genome Technology Center, Stanford University, Exposing layers of transcriptome 
complexity and dynamics. 


+ Lei Hou, Group of Molecular Systems Biology, PICB, Shanghai, China, A Systems Approach to 
Reverse Engineer Lifespan Extension by Dietary Restriction. 


+ Praveen Sethupathy, University of North Carolina Chapel Hill, Regulation of stem cell function 
in the intestine: a genomics tale of gut micro-biota and micro-RNAs. 


May 2016 
Seminar Talk Speakers: 


+ Wei-Hua Chen, Geneva University Hospital, Genève, Switzerland, Economics behind many 
intriguing observations in bacterial genomes. 


+ Kai Ye, Xi'an Jiaotong University, Bioinformatics pharmacology cancer, Novel algorithms for 
next-generation sequence data analysis and their applications in pan-cancer genome data 
and the genome of the Netherlands. 


+ Yufeng Wu, University of Connecticut, Algorithms and Applications for Probability Computation 
on Multispecies Coalescent. 


+ Jialiang Huang, Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health; Boston 
Children’s Hospital, Harvard Medical School 


Jun 2016 
Student Talk Speakers: 


+ Jin-Xi Li, Group of Dermatogenomics, PICB, Shanghai, China, The genetic basis of fingerprint 
patterns and the implications for disease susceptibility. 


+ Chris W. Turck, Max Planck Institute of Psychiatry and Ludwig Maximilians University, Munich, 
Pathway Illumination for Disease Research- Psychiatric Disorders and Antidepressant 
Treatment Response. 
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Jul 2016 
Seminar Talk Speakers: 


+ Yves A. Lussier, The University of Arizona Director, Center for Biomedical Informatics and 
Biostatistics, The personalome era: precision medicine using the dynamic transcriptome and 
intergenic SNPs synergy or antagony. 


+ Guo-Cheng Yuan, Department of Biostatistics and Computational Biology, Dana-Farber 
Cancer Institute and Harvard Chan School of Public Health, Mapping Cell States from 
Single-Cell Gene Expression Data. 


Institute Seminar Speakers: 


+ Sijia Wang, Group of Dermatogenomics, PICB, Shanghai, China, Close in on phenotypes — 
examples of physical appearance variations and their implication on precision medicine. 


Aug 2016 
Seminar Talk Speakers: 


+ Hui Zhang, Biostatistics Department, St. Jude Children’s Research Hospital, Major Statistical 
Challenges in RNA-Seq Count Analysiser. 


Sep 2016 
Seminar Talk Speakers: 


+ Bing Zhou, the University of California in San Diego, Chromatin-interacting RNAs Reveal Gene 
Regulatory Architecture. 


+ Qiaomei Fu, Head of the Ancient DNA Lab at the Institute of Vertebrate Paleontology and 
Paleoanthropology, Chinese Academy of Sciences, Tracing modern human history using 
ancient DNA. 


+ Yu Xue, Department of Bioinformatics & Systems Biology, Huazhong University of Science and 
Technology, PTM Bioinformatics. 


+ Wenfei Jin, NHLBI systems biology center, National Institutes of Health, Single Cell Epigenomic 
Sequencing for Precision Medicine. 


Institute Seminar Speakers: 


+ Shuhua Xu, Group of Population Genomics, PICB, Shanghai, China, Ancestral Origins and 
Genetic History of Tibetan Highlanders. 


+ Qingfeng Song, Group of Plant System Biology, PICB, Shanghai, China,Modeling Canopy 
Photosynthesis to Guide Breeding for Higher Crop Yield and Resource Use Efficiency. 
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Oct 2016 
Seminar Talk Speakers: 


+ Zhaohui Yang, KunMing University of Science and Technology, Genetic mechanism of skin 
color adaptive evolution in East Asians. 


+ Keji Zhao, Systems Biology Center, National Heart, Lung and Blood Institute. Epigenetic 
regulation of T cell differentiation. 


+ WeiXie, School of Life Sciences, Tsinghua University, Epigenetic inheritance and reprogramming when 
life begins. 


Nov 2016 
Seminar Talk Speakers: 


+ Barbara Treutlein, Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary 
Anthropology,Reconstructing human neurogenesis using single cell transcriptomics. 


+ John M. Sedivy, Brown University, Epigenetic Changes and Somatic Retrotransposition in Cellular 
Senescence and Aging. 


- Nikolaus Georg Boris Rajewsky, Systems Biology of Gene Regulatory Elements, Institute for 
Medical Systems Biology, MDC for Molecular Medicine in the Helmholtz Association, Knocking 
down circRNAs and reconstructing tissues from single cell sequencing data. 


+ Peter Kharchenko ,Biomedical Informatics in Harvard Medical School, Analysis of transcriptional 
and genetic heterogeneity in cancer. 


+ Adrian R. Krainer, , Cold Spring Harbor Laboratory, Mechanism-based Antisense Therapy for 
Spinal Muscular Atrophy. 


+ Manfred Kayser, Professor of Forensic Molecular Biology, Department of Genetic Identification, 
Erasmus MC University Medical Center Rotterdam, Genetics and DNA prediction of human 
appearance. 


+ Zoran Nikoloski, Systems Biology and Mathematical Modeling Group, Max Planck Institute of 
Molecular Plant Physiology, How plastic is metabolism? A comparative study across organisms. 


Institute Seminar Speakers: 


+ Li Yang, Group of Computational Transcriptomics and Bioinformatics, PICB, Shanghai, China. 


+ Xiaoran Zhang, Group of Epigenome Biology, PICB, Shanghai, China. 
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Dec 2016 
Seminar Talk Speakers: 


+ Zhengging Ouyang, The Jackson Laboratory for Genomic Medicine, Understanding the 
dynamic regulomes by 3D genomics and RNAomics. 


+ Fuli Yu, Research Center and Baylor College of Medicine Human Genome Sequencing Center. 
The development of foundational projects led to rare variant discoveries for common/rare 
diseases. 


+ Martin Parry, Plant Science for Food Security, Improving Photosynthesis The Engine of Life. 


Feb 2017 
Seminar Talk Speakers: 


+ Xianting Ding, Shanghai Jiao Tong University,School of Biomedical Engineering, Institute for 
Personalized Medicine. 


+ Siyuan Zheng, Department of Genomic Medicine, MD Anderson Cancer Center, From Data 
Driven to Question Driven: insights from high throughput profiling of the cancer genome. 


EVENTS ORGANIZED BY THE INSTITUTE 


8.3 PICB in News 


Science Daily 


China Science Daily 


Science News 
The Economist 


China Science Daily 


People's Daily Online 
CNR News 

Global Times 
ScienceNet 

Radio Shanghai 
Wenhui Daily 

China Science Daily 


Shanghai Science and 
Technology News 


Economic Daily 


The Guardian 


~ 


ive Science 
New Scientist 


United Press 
International 


Time 
Xinhua Net 


People’s Daily Online 
CNR News 


The Paper 


Xinhua Daily Telegraph 


= 


iefang Daily 

Xinmin Evening News 
Oriental Morning Post 
Xinmin.cn 


SDTV 
Science Daily 


Shanghai Science and 
Technology News 


Guokr 
Youth Daily 


Europeans have three times more Neanderthal genes for lipid 
catabolism than Asians or Africans 


Pursuing the scientific and technological cooperation with highest 
quality 


Did Big Brains Sap Our Strength? 
Muscled out: Human beings are brainy weaklings 


Scientists develop novel method to localize variants of positive 
selected genes 


China's scientists make new progress on circular RNA research 
Scientists publish new progress on circular RNA research 

RNA breakthrough 

New progress on circular RNA research 

Scientists make new progress on circular RNA research 
Circular RNA breakthrough 

Scientists find lots of circular RNA with new methods 


Scientists reveal competition between linear RNA and circular RNA 


Researchers make new progress on circular RNA research 


Scan allows scientists to determine biological age from the face 
alone 


Guess Your Age? 3D Facial Scan Beats Doctor's Exam 
Eek! How your face reveals your body's real age 


Study: Face scan alone enough to calculate biological age 


How 3D imaging can tell exactly how old you are 


Chinese scientists find aging trends on face can exactly tell 
biological age 

Age “written” on your face 

Chinese scientists reveal secrets: age “written” on your face 

Chinese scientists quantify aging by scanning face: 40 is dividing 
line 
Chinese scientists find age “written” on face 
Face images analysis can tell how old you are 


Is it reliable to determine age by face? 
- Scientists: 3D facial images can tell age 


Face scan can tell how old you are: 40 is dividing line 
Shanghai scientists: 3D facial images can tell age 
Scientists reveal secrets: Age “written” on face 

Age “written” on your face 


Age “written” on your face 


Face scan can tell how old you are 
3D facial images can predict aging trend 


02-Apr-14 


20-May-14 


27-May-14 
31-May-14 


09-Sep-14 


19-Sep-14 
19-Sep-14 
19-Sep-14 
20-Sep-14 
20-Sep- 
21-Sep- 
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Guangming Daily Analysis of face can predict aging trend 13-Apr-15 

China Science Daily Age “written” on your face 30-Apr-15 

nie Daily Telegraph, President Xi Jinping meets with representatives of Russian experts 

China News: Service, | siding China and their relatives ls aie 

People’s Daily, CCTV 9 

Xinhua Net Giving Thumb up to Envoys of Sino-Russian Friendship 09-May-15 

Nature “News”Column_ | Alzheimer’s origins tied to rise of human intelligence 21-May-15 

China Science Daily Scientists reveal key genetic factors for evolution of Tibetans 29-Jun-15 

Shanghai Science and | Shanghai researchers discover key genetic factors for evolution of 01-Jul-15 

Technology News Tibetan high-altitude adaptation 

eee: scientific Symposium on Retrospect and Prospect of 06-Nov-15 
Computational Biology Research held 

China Science Daily CAS-MPG Partner Institute for Computational Biology celebrates its 16-Nov-15 
10th anniversary. 

Asian Scientist Calorie counting in roundworms increases lifespan 22-Mar-16 

Beijing Daily Connection between you and the world beyond your imagination | 13-Jul-16 

China Setene Deily Scientists unveil secrets affecting hair straightness in Asians and 06-Sep-16 
Europeans 

Shanghai Science and | Five questions about ancestral origins and genetic history of 07-Sep-16 

Technology News Tibetan highlanders p 

Intellectual Tibetan plateau: “Gene melting pot” of human population 10-Sep-16 

Cell Metabolism Women in Metabolism 13-Dec-16 

Asian Scientist Dogs & humans both interbred to adapt 14-Dec-16 

Scientic American ibetan plateau discovery shows humans may be tougher than we 28-Dec-16 
thought 

hine Sande Belly Scientists reveal genetic origins of high-altitude adaption of 29-Dec-16 

ibetan Mastiff 

Scientific American The surprisingly early settlement of the Tibetan plateau 01-Mar-17 

China Saens Daily Computational biology for big data: new opportunities and 23-Mar-17 
challenges 

New Symposium on Computational Biology for Big Data held: A serial of 
sve glial selene ang challenges need to be addressed on the journey toward precision | 23-Mar-17 


Technology News 


medicine 
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Laser scanning confocal microscopy 
Light sheet 

acro microscope 
Microfluidic Chip Platform 


ultifunctional plant efficiency analyzer 


Multi-Mode Microplate Readers 


ucleofector system 
Photosynthesis measuring system 
Photosynthesis measuring system 
Plant light incubator 

Plant light incubator 

Sequencer 

Spectrometer 

Stable isotope analyzer 
Ultra-Sonicator 

Ultraviolet lithography machine 


LSM800 

light sheet Z1 

AXIO ZOOM V16 
ABM/6/350/DCCD/M 


M-PEA-2 


synergy HI 
AAF-1001B 
LI-6400XT 

LI-6400R 

NC350 HC-LC 
NC350 HC-LC 
Illumina Hi-seg 2000 
QE65PRO 
CCIA-46-EP 
Diagenode Bioruptor Plus 
URE2000 


3,178,402 
3,462,523 

457,180 
1,213,143 


220,953 


384,690 
204,700 
217,000 
376,540 
238,627 
238,627 
4,690,500 
208,300 
1,094,379 
210,310 
230,000 


9.3.3 Office and Other Equipment 


345 *Desktop and Workstation Computer 


112* Laptop 
59* Liquid Crystal Display 
76* Laser Printer 


33* Air-condition in server room and lab 


Monitor System 


5* Projector 

Video system 

3* Photocopier 

2* Electric Whiteboard 
6* Scanner 


APPLE Mac MiNi, APPLE imac, 
IBM,Lenovo,Dell 


IBM,HP, APPLE 
ACER, APPLE, LG,Samsung,HP 
HP LJ-1320N, HP LJ-2015D,LJ-2055DN 


Haier KFRD-58LW/Z2, Midea KFR-SOGW/ 
DY-T6(E2),etc. 


Only WAY Information Technology CO. 
LTD 


SONY,OPTOMA 
POLYCOM VSX7000 
Canon iR2870; XEROX 
Panasonic, SMART SB680 
Canon, Epson 


3,835,086.50 


2,157,320.00 
199,341.00 
263,504.00 


604,526.00 


119,399.00 


32,456.00 
55,965.00 
61,480.00 
28,300.00 
15,711.00 
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11 Publications and Software Algorithms 


11.1 


Publications (2014-2017) 


+ Weizhong Chen, Yi Liu, Shanshan Zhu, 


Christopher D. Green, Gang Wei & Jing- 
Dong Jackie Han.Improved nucleosome- 
positioning algorithm iNPS for accurate 
nucleosome positioning fromsequencing 
data. Nat Commun, 2014 Sep 18;5:4909. 


+ Qingging Zhu, Lu Song, Guangdun Peng, 


Na Sun, Jun Chen, Ting Zhang, Nengyin 
Sheng, Wei Tang, Cheng Qian, Yunbo Qiao, 
Ke Tang, Jing-Dong Jackie Han, Jinsong 

Li, Naihe Jing. The transcription factor 
Pou3fl promotes neural fate commitment 
via activation of neural lineage genes and 
inhibition of external signaling pathways. 
eLife. Jun, 2014. 


+ He Z, Bammann H, Han D, Xie G, 


Khaitovich P. Conserved expression of 
lincRNA during human and macaque 
prefrontal cortex development and 
maturation. RNA. 2014 . 


+ Su ZD, Sheng QH, Li QR, Chi H, Jiang X, Yan 


Z, Fu N, He SM, Khaitovich P, Wu JR, Zeng R. 
De novo identification and quantification of 
single amino-acid variants in human brain. 
J Mol Cell Biol, 2014 Oct;6(5):421-33. 


+ Hu HY, He L, Khaitovich P. Deep 


sequencing reveals a novel class of 
bidirectional promoters associated with 
neuronal genes. BMC Genomics. 2014 Jun 
10;15:457. 


+ Chunyun Jiang, Jiajia Mercedes Xu, 


Changpeng Xin, Danny Tholen, Hui Zhang, 
Xin-Guang Zhu, Yanxiu Zhao. Increased 
expression of mitochondria-localized 
carbonic anhydrase activity resulted in 

an increased biomass accumulation in 
Arabidopsis thaliana. Journal of Plant Biology 
2014,57: 366-374. 


+ Jindong Sun, Zhaozhong Feng, Andrew 


Leakey, Xin-Guang Zhu, Carl J Bernacchi, 
Donald R Ort In-consistency of 

mesophyll conductance estimate causes 
the inconsistency for the estimates of 
maximum rate of Rubisco carboxylation 
among the linear, rectangular and non- 
rectangular hyperbola biochemical models 
of leaf photosynthesis - a case study of 
CO2 enrichment and leaf aging effects in 
soybean. Plant Science, 2014,226: 49-60. 


+ Taiyu Chen, Xin-Guang Zhu, Yongjun 


Lin. Major alterations in transcript profiles 
between C,-C, and C, photosynthesis of 
an amphibious species Eleocharis baldwinii. 
Plant Molecular Biology 2014,86: 93-110. 


+ Lin Wang, Angelika Czedik-Eysenberg, 


Rachel A. Mertz, Yaqing Si, Takayuki Tohge, 
Adriano Nunes-Nesi, Stephanie Arrivault, 
Lauren K. Dedow, Douglas W. Bryant, 

Wen Zhou, Jiajia Xu, Sarit Weissmann, 
Anthony Studer, Pinghua Li, Cankui Zhang, 
Therese LaRue, Ying Shao, Zehong Ding, 
Qi Sun, Rohan V. Patel, Robert Turgeon, 
Xin-Guang Zhu, Nicholas J. Provart, Todd 
C. Mockler, Alisdair R. Fernie, Mark Stitt, 
Peng Liu, Thomas P. Brutnell Comparative 
analyses of C, and C3 photosynthesis in 
developing leaves of maize and rice. Nature 
Biotechnology,2014,32: 1158-1165. 


+ Xianbin Yu, Guangyong Zheng, Lanlan 


Shan, Guofeng Meng, Martin Vingron, 
Qi Liu, Xin-Guang Zhu. Reconstruction 
of gene regulatory network related to 
photosynthesis in Arabidopsis thaliana. 
Frontiers in Plant Sciences. doi: 10.3389/ 
fpls.2014.00273. 


+ Yi-Bo Chen, Tian-Cong Lu, Hong-Xia Wang, 
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Jie Shen, Tian-Tian Bu, Qing Chao, Zhi- 
Fang Gao, Xin-Guang Zhu, Yue-Feng 
Wang, Bai-Chen Wang.Posttranslational 
modification of maize chloroplast pyruvate 
orthophosphate dikinase reveals the 
precise regulatory mechanism of its 
enzyme activity. Plant Physiology, 2014,165: 
534-549. 


+ Yuanyuan Li, Jiajia Xu, Noor Ul Haq, Hui 
Zhang, Xin-Guang Zhu. Was low CO2 a 
driving force of C, evolution? Arabidopsis 
responses to long-term low CO, stress. 
Journal of Experimental Botany 2014,65: 
3657-67. 


+ Ding Q, Hu Y, Xu S, Wang G, Li H, Zhang 

R, Yan S, Wang J, Jin L. Neanderthal 

Origin of the Haplotypes Carrying the 
Functional Variant Val92Met in the MCIR in 
Modern Humans. Molecular Biology and 
Evolution. 2014. 31(8):1994-2003. 


+ Guo J, Tan J, Yang Y, Zhou HK, Hu S, Hashan A, 
Bahaxar N, Xu S, Weaver TD, Jin L, Stoneking 
M, Tang K. Variation and signatures of 
selection on the human face. J.Hum. 

Evol. 2014. 75:143-152. 


+ Peng L, Zhao Q, Li Q, Li M, Li C, Xu T, Jing 
X, Zhu X, Wang Y, Li F, Liu R, Zhong C, Pan 
Q, Zeng B, Liao Q, Hu B, Hu ZX, Huang 
YS, Sham P Liu J, Xu S, Wang J, Gao 

ZL, Wang Y. The p.Ser267Phe variant in 
SLC10A1 is associated with resistance 

to chronic hepatitis B. Hepatology. 2014. 
61(4):1251-1260. 


+ Hatin WI, Nur-Shafawati AR, Etemad A, Jin 
W, Qin P, Xu S, Jin L, Tan SG, Limprasert 

P Feisal MA, Rizman-idid M, Zilfalil 

BA; HUGO Pan-Asian SNP Consortium. 

A genome wide pattern of population 


structure and admixture in peninsular 
Malaysia Malays. The HUGO Journal. 2014. 
8:5. 


« Yang X, Al-Bustan S, Feng Q, Guo W, Ma Z, 


Marafie M, Jacob S, Al-Mulla F and Xu S. The 
influence of admixture and consanguinity 
on population genetic diversity in Mid- 

dle East. Journal of Human Genetics. 2014. 
59:615-622. 


« Wang Y, Zhou Y, Li L, Chen X, Liu Y, Ma Z and 


Xu S. A new method for modeling coales- 
cent processes with recombination. BMC 
Bioinformatics. 2014. 15:273. 


+ Li J, Lou H, Yang X, Lu D, Li S, Jin L, Pan 


X, Yang W, Song M, Mamatyusupu D, Xu 

S. Genetic architectures of ADME genes in 
five Eurasian admixed populations and im- 
plications for drug safety and efficacy. Jour- 
nal of Medical Genetics. 2014. 51(9):614-22. 


+ Deng L, Hoh BP, Lu D, Fu R, Phipps ME, Li 


S, Nur-Shafawati AR, Hatin WI, Ismail 

E, Mokhtar SS, Jin L, Zilfalil BA, Marshall 

CR, Scherer SW, Al-Mulla F, Xu S. The 
population genomic landscape of human 
genetic structure, admixture history and 
local adaptation in Peninsular Malaysia. Hum 
Genet. 2014. 133(9):1169-1185. 


+ Li J, Lao X, Zhang C, Tian L, Lu D and 


Xu S. Increased genetic diversity of 
ADME genes in African Americans 


compared with their putative ancestral 
source populations and implications for 
Pharmacogenomics. BMC Genetics. 2014. 
15:52. 


- Wolf S, Freier E, Cui Q, Gerwert K. Infrared 


spectral marker bands characterizing a 
transient water wire inside a hydrophobic 
membrane protein. J. Chem. Phys., 2014, 141, 
22D524. 
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+ Wolf S, Freier E, Gerwert K. A Delocalized 


Proton-Binding Site within a membrane 
Protein. Biophysical Journal, 2014, 107, 
174-184 


+ Krauß SD, Petersen D, Niedieker D, Fricke 


|, Freier E, El-Mashtoly SF, Gerwert K. and 
Mosig A. Colocalization of fluorescence 
and Raman microscopic images for the 


identification of subcellular compartments: 


a validation study. Analyst, 2014, 140(7), 
pp.2360-2368 


+ Yang J, Grünewald S, Xu Y, Wan, X F. 


Quartet-based methods to reconstruct 
phylogenetic networks. BMC Systems 
Biology. 2014. 8(1), 21. 


+ Gou X, Wang Z, Li N, Qiu F Xu Z, Yan D, 


Yang S, Jia J, Kong X, Wei Z, Lu S, Lian L, 
Wu C, Wang X, Li G, Ma T, Jiang Q, Zhao 
X, Yang J, Liu B, Wei D, Li H, Yang J, Yan 
Y, Zhao G, Dong X, Li M, Deng W, Leng J, 
Wei C, Wang C, Mao H, Zhang H, Ding G, 
Li Y. Whole-genome sequencing of six 
dog breeds from continuous altitudes 


reveals adaptation to high-altitude hypoxia. 


Genome Research. 2014 Aug;24(8):1308-15. 


+ Shen B, Teschendorff AE, Zhi D, Xia J. 


Biomedical data integration, modeling, 
and simulation in the era of big data and 
translational medicine. Biomed Res Int. 
2014;2014:731546. 


+ Anjum S, Fourkala EO, Zikan M, Wong A, 


Gentry-Maharaj A, Jones A, Hardy R, Cibula 
D, Kuh D, Jacobs |J, Teschendorff AE, 
Menon U, Widschwendter M. A BRCA1- 
mutation associated DNA methylation 
signature in blood cells predicts sporadic 
breast cancer incidence and survival. 
Genome Med. 2014 Jun 27;6(6):47. 


+ Gomez-Cabrero D, Abugessaisa |, Maier D, 


Teschendorff A, Merkenschlager M, Gisel 
A, Ballestar E, Bongcam-Rudloff E, Conesa 
A, Tegnér J. Data integration in the era of 
omics: current and future challenges. BMC 
Syst Biol. 2014;8 Suppl 2:11. 


+ Teschendorff AE, Liu X, Caren H, Pollard 


SM, Beck S, Widschwendter M, Chen L. The 
dynamics of DNA methylation covariation 
patterns in carcinogenesis. PLoS Comput 
Biol. 2014 Jul 10;10(7):e1003709. 


- Ma Z, Teschendorff AE, Yu H, Taghia J, Guo 


J. Comparisons of non-Gaussian statistical 
models in DNA methylation analysis. Int J 
Mol Sci. 2014 Jun 16;15(6):10835-54. 


+ Jiao Y, Widschwenater M, Teschendorff 


AE. A systems-level integrative framework 
for genome-wide DNA methylation 

and gene expression data identifies 
differential gene expression modules under 
epigenetic control. Bioinformatics 2014 Aug 
15;30(16):2360-6. 


+ Steegenga WT, Boekschoten MV, Lute 


C, Hooiveld GJ, de Groot PJ, Morris TJ, 
Teschendorff AE, Butcher LM, Beck S, 
Muller M. Genome-wide age-related 
changes in DNA methylation and gene 
expression in human PBMCs. Age (Dordr). 
2014 Jun;36(3):9648. 


+ Guo J, Tan J, Yang Y, Zhou H, Hu S,Hashan A, 


Bahaxar N, Xu S, Weaver TD, Jin L, Stoneking 
M, Tang K, Variation and signatures of 
selection on the human face, J Hum 
Evol,2014, 75:143-52. 


+ Wang G-Z, Marini S, Ma X, Yang Q, Zhang 


X and Zhu Y. Improvement of Dscam 
homophilic binding affinity throughout 
Drosophila evolution. BMC Evol Bio, 2014 (1), 
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186. 


+ Liu J, Wang X, Li J, Wang H, Wei G and Yan 
J. Reconstruction of the gene regulatory 
network involved in the sonic hedgehog 
pathway with a potential role in early 
development of the mouse brain. PLoS 
Computational Biology, 2014, 10: e1003884. 


Q, Zou J, Wang M, Ding X, Chepelev |, 
hou X, Zhao W, Wei G, Cui J, Zhao K et 
. Critical role of histone demethylase 


Ni 


Ke o 


mjd3 in the regulation of CD4+ T-cell 
ifferentiation. Nature Communications, 
014, 5: 5780. 


RJ. O 


+ Li Q, Wang HY, Chepelev |, Zhu Q, Wei G, 
Zhao K and Wang RF. Stage-dependent and 
locus-specific role of histone demethylase 
Jumonji D3 (JMJD3) in the embryonic 
stages of lung development. PLoS Genetics, 
2014, 10: e1004524. 


+ Chen W, Liu Y, Zhu S, Green CD, Wei G and 
Han JD. Improved nucleosome-positioning 
algorithm iNPS for accurate nucleosome 
positioning from sequencing data. Nature 
Communications, 2014, 5: 4909. 


+ Yang L and Chen LL. Microexons go big. 
Cell, 2014, 159: 1488-1489 (Preview) 


+ Yang L and Chen LL. Competition of RNA 
splicing; line in or circle up. Sci China Life 
Sci, 2014, 57: 1232-1233 (Review) 


+ Dong R, Chen LL and Yang L. Research 
progress of circular RNA in the post- 
genome era. Chinese J Cell Biol, 2014, 36: 
1455-1459 (Review, in Chinese) 


+ Zhang XO, Wang HB, Zhang Y, Lu X, Chen 
LL and Yang L. Complementary sequence- 
mediated exon circularization. Cell, 2014, 
159: 134-147 


+ Gerstein M, Rozowsky J, Yan KK, Wang D, 


Cheng C, Brown JB, Davis C, Hillier L, Sisu C, 
Li JJ, Pei B, Harmanci AO, Duff MO, Djebali 
S, Alexander RP, Alver BH, Auerbach R, Bell 
K, Bickel PJ, Boech ME, Boley NP Booth 

BW, Cherbas L, Cherbas P, Di C, Dobin A, 
Drenkow J, Ewing B, Fang G, Fastuca M, 
Feingold EA, Frankish A, Gao G, Good PJ, 
Guigo R, Hammonds A, Harrow J, Hoskins 
RA, Howald C, Hu L, Huang H, Hubbard TJP, 
Huynh C, Jha S, Kasper D, Kato M, Kaufman 
TC, Kitchen RR, Ladewig E, Lagarde J, Lai E, 
Leng J, Lu Z, MacCoss M, May G, McWhirter 
R, Merrihew G, Miller DM, Mortazavi A, 
Murad R, Oliver B, Olson S, Park PJ, Pazin 
MJ, Perrimon N, Pervouchine D, Reinke V, 


Reymond A, Robinson G, Samsonova A, 
Saunders G, Schlesinger F, Sethi A, Slack FJ, 
Spencer WC, Stoiber MH, Strasbourger P, 
Tanzer A, Thompson OA, Wan KH, Wang 

G, Wang H, Watkins KL, Wen J, Wen K, Xue 
C, Yang L, Yip K, Zaleski C, Zhang Y, Zheng 
H, Brenner SE, Graveley BR, Celniker SE, 
Gingeras TR and Waterston R. Comparative 
analysis of the transcriptome across distant 
species. Nature, 2014, 512: 445-448 


+ Zhang Y, Yang L and Chen LL. Life without 


A tail: new formats of long noncoding 
RNAs. Int J Biochem and Cell Biol, 2014, 54: 
338-349 (Review) 


+ Zhang XO, Yin QF, Chen LL and Yang 


L. Gene expression profiling of non- 
polyadenylated RNA-seq across species. 
Genomics Data, 2014, 2: 237-241 


+ Zhang XO, Yin QF, Wang HB, Zhang Y, 


Chen T, Zheng P, Lu X, Chen LL and Yang 
L. Species-specific alternative splicing leads 
to unique expression of sno-lncRNAs. BMC 
Genomics, 2014, 15: 287 (Highly Accessed) 
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+ Xiang JF, Yin QF, Chen T, Zhang Y, Zhang 


XO, Wang HB, Ge JH, Lu XH, Yang L and 
Chen LL. Human colorectal cancer-specific 
CCATI-L IncRNA regulates long-range 
chromatin interactions in the MYC locus. 
Cell Res, 2014, 24:513-531 (Cover Article and 
Issue Highlight) 


+ Wei H, Wang Z. Engineering RNA-binding 


proteins with diverse activities. Wiley 
interdisciplinary reviews. RNA. 2015; 6(6):597- 
613. PMID: 26329122 


+ Szempruch AJ, Choudhury R, Wang Z, 


Hajduk SL. In vivo analysis of trypanosome 
mitochondrial RNA function by artificial 
site-specific RNA endonuclease-mediated 
knockdown. RNA. 2015; 21(10):1781-9. PMID: 
26264591, PMCID: PMC4574754 


+ Wang Z. Not just a sponge: new functions 


of circular RNAs discovered. Science China. 
Life sciences. 2015; 58(4):407-8. PMID: 
25680857 


+ Tsai YS, Dominguez D, Gomez SM, Wang 


Z. Transcriptome-wide identification and 
study of cancer-specific splicing events 
across multiple tumors. Oncotarget. 2015; 
6(9):6825-39. PMID: 25749525, PMCID: 
PMC4466652 


+ Wang Y, Wang Z. Efficient backsplicing 


produces translatable circular mRNAs. RNA. 
2015; 21(2):172-9. PMID: 25449546, PMCID: 
PMC4338345 


+ Han Zhang, Hao Cheng, Qingqing Wang, 


Xianping Zeng, Yanfen Chen, Jin Yan, 
Yanran Sun, Xiaoxi Zhao, Weijing Li, Chao 
Gao, Wenyu Gong, Bei Li, Ruidong Zhang, 
Li Nan, Young Wu, Shilai Bao, Jing-Dong 

J Han and Huyong Zheng. An advanced 
fragment analysis-based individualized 


subtype classification of pediatric acute 
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Algorithm/ 


Jing-Dong 
Jackie Han 


Philipp 
Khaitovich 


Klaus Gerwert 


Haipeng Li 


Zhen Shao 


iTranscriptome 
eDM_BN 


Facial aging predic- 
tor 


iNPS 


Repeats analysis 
pipeline 


brainDeconv 


Primate Brain 
Laminar 
Transcriptome 
Database 


SPECREG 


Omnisphero 


FRColoc 


Local Mode Analysis 


FastEPRR 


MAP 


a web portal to provide open access of the 
spatial transcriptome data of mouse E7.0 
embryo 

A Systems Approach to Reverse Engineer 
Lifespan Extension by Dietary Restriction 

A method to find patterns of age- 

ing based on certain facial features 

An improved algorithm for accurate nucle- 
osome positioning from sequencing data 
NGS data analysis for repetitive DNA ele- 
ments 

A transcriptome deconvolution based 
procedure to analyze brain RNA-seq data 
by discriminating changes due to cell type 
composition changes and those which are 
independent from composition differences 
The transcriptome of cortical layers in 
brains of humans, chimpanzees and rhesus 
macaques. The downloadable contents are 
available while the web interface is under 
reconstruction 
Fully automated registration of vibrational 
microspectroscopic images in histologically 
stained tissue sections 

High-content image analysis (HCA) 
approach for phenotypic developmental 
neurotoxicity (DNT) screenings of organoid 
neurosphere cultures in vitro 
Colocalization of fluorescence and Raman 
microscopic images for the identification 
of subcellular compartments 

A method to decode IR spectra by 
visualizing molecular detail through QM/ 
MM simulations. 

Fast estimation of population 
recombination rates in the genomic era 

A statistical model to compare isotope 
labeling based quantitative proteomic 
data and identify proteins with significant 
abundance changes 


DEV CELL, 2016 


CELL METAB,2016 


CELL RES, 2015 


NAT COMMUN 2014 


CELL REP, 2014 


SCI REP-UK, 2017 


NAT NEUROSCI, 2017 


BMC 
BIOINFORMATICS, 
2015 


ARCH TOXICOL, 2017 


ANALYST, 2015 


J PHYS CHEM B, 2017 


G3-GENES GENOM 
GENET, 2016 


NAT CELL BIOL, 2017 
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Algorithm/ : 


Andrew 
Teschendorff 


Guangzhong 
Wang 


Li Yang 


FEM 


DART 


iEVORA 


ChAMP 


dbDEMC 2.0 


EpiDISH 
SCENT 
Dscam1 Web Server 


CIRCexplorer2 


CIRCpedia 


CIRCpseudo 


TERate 


CSI 


CIRCexplorer 


Inference of Functional Epigenetic Modules 


Denoising Algorithm using Relevance 
nework Topology for inferring signaling 
pathway activity in expression profiles 
Epigenetic Variable Outliers for cancer 
Risk prediction Analysis (Feature selection 
framework for identifying cancer risk DNA 
methylation markers) 

Comprehensive Analysis Methylation 
Pipeline for Illumina beadarray data. 
dbDEMC 2.0: updated database of 
differentially expressed miRNAs in human 
cancers. 


Epigenetic Dissection of Intra-Sample 
cellular Heterogeneity 


Single Cell Entropy for quantifying 
differentiation potency of single cells 


online prediction of Dscam1 self- and 
hetero-affinity 


A toolset for circular RNA identification 
and characterization. 

An integrative database, aiming to 
annotating alternative back-splicing 
and alternative splicing in circRNAs 
across different cell lines. 

A pipeline to map back-splicing 
junction sequences for circRNA- 
derived pseudogenes detection. 

A computational pipeline to measure 
transcription elongation rates (TERs) 
with 4SUDRB-Seq. 

A pipeline to quantitate RNA pairing 
capacity of orientation-opposite 
complementary sequences across 
circRNA-flanking introns. 

A combined strategy to identify 
circular RNAs (circRNAs and ciRNAs). 


BIOINFORMATICS, 
2014 


GENOME BIOL ,2015 


NAT COMMUN , 2016 


BIOINFORMATICS, 
2014 


NUCLEIC ACIDS RES, 
2017 


BMC 
BIOINFORMATICS. 
2017 


NAT COMMU 


, 2017 


BIOI 
2017 


FORMATICS, 


GENOME RES, 2016 


GENOME RES, 2016 


GEMSRESPONG 


CELL REP, 2016 


RNA BIOL, 2016 


CELL, 2014 
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